English

Stochastic Distributed Learning with Gradient Quantization and Variance Reduction

Optimization and Control 2019-04-11 v1

Abstract

We consider distributed optimization where the objective function is spread among different devices, each sending incremental model updates to a central server. To alleviate the communication bottleneck, recent work proposed various schemes to compress (e.g.\ quantize or sparsify) the gradients, thereby introducing additional variance ω1\omega \geq 1 that might slow down convergence. For strongly convex functions with condition number κ\kappa distributed among nn machines, we (i) give a scheme that converges in O((κ+κωn+ω)\mathcal{O}((\kappa + \kappa \frac{\omega}{n} + \omega) log(1/ϵ))\log (1/\epsilon)) steps to a neighborhood of the optimal solution. For objective functions with a finite-sum structure, each worker having less than mm components, we (ii) present novel variance reduced schemes that converge in O((κ+κωn+ω+m)log(1/ϵ))\mathcal{O}((\kappa + \kappa \frac{\omega}{n} + \omega + m)\log(1/\epsilon)) steps to arbitrary accuracy ϵ>0\epsilon > 0. These are the first methods that achieve linear convergence for arbitrary quantized updates. We also (iii) give analysis for the weakly convex and non-convex cases and (iv) verify in experiments that our novel variance reduced schemes are more efficient than the baselines.

Keywords

Cite

@article{arxiv.1904.05115,
  title  = {Stochastic Distributed Learning with Gradient Quantization and Variance Reduction},
  author = {Samuel Horváth and Dmitry Kovalev and Konstantin Mishchenko and Sebastian Stich and Peter Richtárik},
  journal= {arXiv preprint arXiv:1904.05115},
  year   = {2019}
}

Comments

10 pages, 24 pages of Appendix