Related papers: Regularized Top-$k$: A Bayesian Framework for Grad…

Novel Gradient Sparsification Algorithm via Bayesian Inference

Error accumulation is an essential component of the Top-$k$ sparsification method in distributed gradient descent. It implicitly scales the learning rate and prevents the slow-down of lateral movement, but it can also deteriorate…

Machine Learning · Computer Science 2024-09-24 Ali Bereyhi , Ben Liang , Gary Boudreau , Ali Afana

Understanding Top-k Sparsification in Distributed Deep Learning

Distributed stochastic gradient descent (SGD) algorithms are widely deployed in training large-scale deep learning models, while the communication overhead among workers becomes the new system bottleneck. Recently proposed gradient…

Machine Learning · Computer Science 2019-11-21 Shaohuai Shi , Xiaowen Chu , Ka Chun Cheung , Simon See

Adaptive Top-K in SGD for Communication-Efficient Distributed Learning

Distributed stochastic gradient descent (SGD) with gradient compression has become a popular communication-efficient solution for accelerating distributed learning. One commonly used method for gradient compression is Top-K sparsification,…

Machine Learning · Computer Science 2023-09-12 Mengzhe Ruan , Guangfeng Yan , Yuanzhang Xiao , Linqi Song , Weitao Xu

Efficient Distributed Training through Gradient Compression with Sparsification and Quantization Techniques

This study investigates the impact of gradient compression on distributed training performance, focusing on sparsification and quantization techniques, including top-k, DGC, and QSGD. In baseline experiments, random-k compression results in…

Machine Learning · Computer Science 2025-02-12 Shruti Singh , Shantanu Kumar

rTop-k: A Statistical Estimation Approach to Distributed SGD

The large communication cost for exchanging gradients between different nodes significantly limits the scalability of distributed training for large-scale learning models. Motivated by this observation, there has been significant recent…

Machine Learning · Computer Science 2020-12-04 Leighton Pate Barnes , Huseyin A. Inan , Berivan Isik , Ayfer Ozgur

Sparsified SGD with Memory

Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders…

Machine Learning · Computer Science 2018-11-30 Sebastian U. Stich , Jean-Baptiste Cordonnier , Martin Jaggi

A Distributed Synchronous SGD Algorithm with Global Top-$k$ Sparsification for Low Bandwidth Networks

Distributed synchronous stochastic gradient descent (S-SGD) has been widely used in training large-scale deep neural networks (DNNs), but it typically requires very high communication bandwidth between computational workers (e.g., GPUs) to…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-04-18 Shaohuai Shi , Qiang Wang , Kaiyong Zhao , Zhenheng Tang , Yuxin Wang , Xiang Huang , Xiaowen Chu

Rethinking gradient sparsification as total error minimization

Gradient compression is a widely-established remedy to tackle the communication bottleneck in distributed training of large deep neural networks (DNNs). Under the error-feedback framework, Top-$k$ sparsification, sometimes with $k$ as…

Machine Learning · Computer Science 2021-08-03 Atal Narayan Sahu , Aritra Dutta , Ahmed M. Abdelmoniem , Trambak Banerjee , Marco Canini , Panos Kalnis

Near-Optimal Sparse Allreduce for Distributed Deep Learning

Communication overhead is one of the major obstacles to train large deep learning models at scale. Gradient sparsification is a promising technique to reduce the communication volume. However, it is very challenging to obtain real…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-22 Shigang Li , Torsten Hoefler

Empirical Analysis on Top-k Gradient Sparsification for Distributed Deep Learning in a Supercomputing Environment

To train deep learning models faster, distributed training on multiple GPUs is the very popular scheme in recent years. However, the communication bandwidth is still a major bottleneck of training performance. To improve overall training…

Machine Learning · Computer Science 2022-09-20 Daegun Yoon , Sangyoon Oh

Top-$k$ Regularization for Supervised Feature Selection

Feature selection identifies subsets of informative features and reduces dimensions in the original feature space, helping provide insights into data generation or a variety of domain problems. Existing methods mainly depend on feature…

Machine Learning · Computer Science 2021-06-07 Xinxing Wu , Qiang Cheng

A PAC-Bayesian Analysis of Randomized Learning with Application to Stochastic Gradient Descent

We study the generalization error of randomized learning algorithms -- focusing on stochastic gradient descent (SGD) -- using a novel combination of PAC-Bayes and algorithmic stability. Importantly, our generalization bounds hold for all…

Machine Learning · Computer Science 2020-06-23 Ben London

Variance Reduction with Sparse Gradients

Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients to reduce the variance of stochastic gradients. Compared to SGD, these methods require at least double the number of operations per…

Machine Learning · Computer Science 2020-01-28 Melih Elibol , Lihua Lei , Michael I. Jordan

Downlink Compression Improves TopK Sparsification

Training large neural networks is time consuming. To speed up the process, distributed training is often used. One of the largest bottlenecks in distributed training is communicating gradients across different nodes. Different gradient…

Machine Learning · Computer Science 2022-10-03 William Zou , Hans De Sterck , Jun Liu

S-D-RSM: Stochastic Distributed Regularized Splitting Method for Large-Scale Convex Optimization Problems

This paper investigates the problems large-scale distributed composite convex optimization, with motivations from a broad range of applications, including multi-agent systems, federated learning, smart grids, wireless sensor networks,…

Optimization and Control · Mathematics 2025-12-16 Maoran Wang , Xingju Cai , Yongxin Chen

Activations and Gradients Compression for Model-Parallel Training

Large neural networks require enormous computational clusters of machines. Model-parallel training, when the model architecture is partitioned sequentially between workers, is a popular approach for training modern models. Information…

Machine Learning · Computer Science 2024-03-27 Mikhail Rudakov , Aleksandr Beznosikov , Yaroslav Kholodov , Alexander Gasnikov

Sparse Spectrum Gaussian Process for Bayesian Optimization

We propose a novel sparse spectrum approximation of Gaussian process (GP) tailored for Bayesian optimization. Whilst the current sparse spectrum methods provide desired approximations for regression problems, it is observed that this…

Machine Learning · Computer Science 2020-06-09 Ang Yang , Cheng Li , Santu Rana , Sunil Gupta , Svetha Venkatesh

Gradient Sparsification for Communication-Efficient Distributed Optimization

Modern large scale machine learning applications require stochastic optimization algorithms to be implemented on distributed computational architectures. A key bottleneck is the communication overhead for exchanging information such as…

Machine Learning · Computer Science 2017-10-31 Jianqiao Wangni , Jialei Wang , Ji Liu , Tong Zhang

Tuning the Scheduling of Distributed Stochastic Gradient Descent with Bayesian Optimization

We present an optimizer which uses Bayesian optimization to tune the system parameters of distributed stochastic gradient descent (SGD). Given a specific context, our goal is to quickly find efficient configurations which appropriately…

Machine Learning · Statistics 2016-12-04 Valentin Dalibard , Michael Schaarschmidt , Eiko Yoneki

Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Recent works have shown that on sufficiently over-parametrized neural nets, gradient descent with relatively large initialization optimizes a prediction function in the RKHS of the Neural Tangent Kernel (NTK). This analysis leads to global…

Machine Learning · Statistics 2020-04-28 Colin Wei , Jason D. Lee , Qiang Liu , Tengyu Ma