Related papers: Variance Reduction with Sparse Gradients

Understanding Top-k Sparsification in Distributed Deep Learning

Distributed stochastic gradient descent (SGD) algorithms are widely deployed in training large-scale deep learning models, while the communication overhead among workers becomes the new system bottleneck. Recently proposed gradient…

Machine Learning · Computer Science 2019-11-21 Shaohuai Shi , Xiaowen Chu , Ka Chun Cheung , Simon See

Stop Wasting My Gradients: Practical SVRG

We present and analyze several strategies for improving the performance of stochastic variance-reduced gradient (SVRG) methods. We first show that the convergence rate of these methods can be preserved under a decreasing sequence of errors…

Machine Learning · Computer Science 2016-08-06 Reza Babanezhad , Mohamed Osama Ahmed , Alim Virani , Mark Schmidt , Jakub Konečný , Scott Sallinen

A Distributed Synchronous SGD Algorithm with Global Top-$k$ Sparsification for Low Bandwidth Networks

Distributed synchronous stochastic gradient descent (S-SGD) has been widely used in training large-scale deep neural networks (DNNs), but it typically requires very high communication bandwidth between computational workers (e.g., GPUs) to…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-04-18 Shaohuai Shi , Qiang Wang , Kaiyong Zhao , Zhenheng Tang , Yuxin Wang , Xiang Huang , Xiaowen Chu

Trading-off variance and complexity in stochastic gradient descent

Stochastic gradient descent is the method of choice for large-scale machine learning problems, by virtue of its light complexity per iteration. However, it lags behind its non-stochastic counterparts with respect to the convergence rate,…

Machine Learning · Statistics 2016-03-23 Vatsal Shah , Megasthenis Asteris , Anastasios Kyrillidis , Sujay Sanghavi

Variance Reduction for Distributed Stochastic Gradient Descent

Variance reduction (VR) methods boost the performance of stochastic gradient descent (SGD) by enabling the use of larger, constant stepsizes and preserving linear convergence rates. However, current variance reduced SGD methods require…

Machine Learning · Computer Science 2017-04-10 Soham De , Gavin Taylor , Tom Goldstein

rTop-k: A Statistical Estimation Approach to Distributed SGD

The large communication cost for exchanging gradients between different nodes significantly limits the scalability of distributed training for large-scale learning models. Motivated by this observation, there has been significant recent…

Machine Learning · Computer Science 2020-12-04 Leighton Pate Barnes , Huseyin A. Inan , Berivan Isik , Ayfer Ozgur

Minimal Variance Sampling in Stochastic Gradient Boosting

Stochastic Gradient Boosting (SGB) is a widely used approach to regularization of boosting models based on decision trees. It was shown that, in many cases, random sampling at each iteration can lead to better generalization performance of…

Machine Learning · Statistics 2019-10-30 Bulat Ibragimov , Gleb Gusev

Variance Suppression: Balanced Training Process in Deep Learning

Stochastic gradient descent updates parameters with summation gradient computed from a random data batch. This summation will lead to unbalanced training process if the data we obtained is unbalanced. To address this issue, this paper takes…

Machine Learning · Computer Science 2019-05-22 Tao Yi , Xingxuan Wang

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite-Sum Structure

Stochastic optimization algorithms with variance reduction have proven successful for minimizing large finite sums of functions. Unfortunately, these techniques are unable to deal with stochastic perturbations of input data, induced for…

Machine Learning · Statistics 2017-11-16 Alberto Bietti , Julien Mairal

Stabilized Sparse Online Learning for Sparse Data

Stochastic gradient descent (SGD) is commonly used for optimization in large-scale machine learning problems. Langford et al. (2009) introduce a sparse online learning method to induce sparsity via truncated gradient. With high-dimensional…

Machine Learning · Statistics 2017-05-10 Yuting Ma , Tian Zheng

Sparsified SGD with Memory

Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders…

Machine Learning · Computer Science 2018-11-30 Sebastian U. Stich , Jean-Baptiste Cordonnier , Martin Jaggi

An Analysis of Stochastic Variance Reduced Gradient for Linear Inverse Problems

Stochastic variance reduced gradient (SVRG) is a popular variance reduction technique for accelerating stochastic gradient descent (SGD). We provide a first analysis of the method for solving a class of linear inverse problems in the lens…

Numerical Analysis · Mathematics 2022-01-19 Bangti Jin , Zehui Zhou , Jun Zou

AdaBatch: Efficient Gradient Aggregation Rules for Sequential and Parallel Stochastic Gradient Methods

We study a new aggregation operator for gradients coming from a mini-batch for stochastic gradient (SG) methods that allows a significant speed-up in the case of sparse optimization problems. We call this method AdaBatch and it only…

Machine Learning · Computer Science 2017-11-07 Alexandre Défossez , Francis Bach

Adaptive Top-K in SGD for Communication-Efficient Distributed Learning

Distributed stochastic gradient descent (SGD) with gradient compression has become a popular communication-efficient solution for accelerating distributed learning. One commonly used method for gradient compression is Top-K sparsification,…

Machine Learning · Computer Science 2023-09-12 Mengzhe Ruan , Guangfeng Yan , Yuanzhang Xiao , Linqi Song , Weitao Xu

Variance Reduced Stochastic Gradient Descent with Neighbors

Stochastic Gradient Descent (SGD) is a workhorse in machine learning, yet its slow convergence can be a computational bottleneck. Variance reduction techniques such as SAG, SVRG and SAGA have been proposed to overcome this weakness,…

Machine Learning · Computer Science 2016-02-29 Thomas Hofmann , Aurelien Lucchi , Simon Lacoste-Julien , Brian McWilliams

Novel Gradient Sparsification Algorithm via Bayesian Inference

Error accumulation is an essential component of the Top-$k$ sparsification method in distributed gradient descent. It implicitly scales the learning rate and prevents the slow-down of lateral movement, but it can also deteriorate…

Machine Learning · Computer Science 2024-09-24 Ali Bereyhi , Ben Liang , Gary Boudreau , Ali Afana

On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants

We study optimization algorithms based on variance reduction for stochastic gradient descent (SGD). Remarkable recent progress has been made in this direction through development of algorithms like SAG, SVRG, SAGA. These algorithms have…

Machine Learning · Computer Science 2016-01-26 Sashank J. Reddi , Ahmed Hefny , Suvrit Sra , Barnabás Póczos , Alex Smola

Empirical Analysis on Top-k Gradient Sparsification for Distributed Deep Learning in a Supercomputing Environment

To train deep learning models faster, distributed training on multiple GPUs is the very popular scheme in recent years. However, the communication bandwidth is still a major bottleneck of training performance. To improve overall training…

Machine Learning · Computer Science 2022-09-20 Daegun Yoon , Sangyoon Oh

Towards Better Generalization: BP-SVRG in Training Deep Neural Networks

Stochastic variance-reduced gradient (SVRG) is a classical optimization method. Although it is theoretically proved to have better convergence performance than stochastic gradient descent (SGD), the generalization performance of SVRG…

Machine Learning · Statistics 2019-08-20 Hao Jin , Dachao Lin , Zhihua Zhang

Accelerating Minibatch Stochastic Gradient Descent using Typicality Sampling

Machine learning, especially deep neural networks, has been rapidly developed in fields including computer vision, speech recognition and reinforcement learning. Although Mini-batch SGD is one of the most popular stochastic optimization…

Machine Learning · Computer Science 2019-03-12 Xinyu Peng , Li Li , Fei-Yue Wang