Related papers: Novel Gradient Sparsification Algorithm via Bayesi…

Regularized Top-$k$: A Bayesian Framework for Gradient Sparsification

Error accumulation is effective for gradient sparsification in distributed settings: initially-unselected gradient entries are eventually selected as their accumulated error exceeds a certain level. The accumulation essentially behaves as a…

Machine Learning · Computer Science 2026-02-17 Ali Bereyhi , Ben Liang , Gary Boudreau , Ali Afana

Understanding Top-k Sparsification in Distributed Deep Learning

Distributed stochastic gradient descent (SGD) algorithms are widely deployed in training large-scale deep learning models, while the communication overhead among workers becomes the new system bottleneck. Recently proposed gradient…

Machine Learning · Computer Science 2019-11-21 Shaohuai Shi , Xiaowen Chu , Ka Chun Cheung , Simon See

Sparsified SGD with Memory

Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders…

Machine Learning · Computer Science 2018-11-30 Sebastian U. Stich , Jean-Baptiste Cordonnier , Martin Jaggi

Adaptive Top-K in SGD for Communication-Efficient Distributed Learning

Distributed stochastic gradient descent (SGD) with gradient compression has become a popular communication-efficient solution for accelerating distributed learning. One commonly used method for gradient compression is Top-K sparsification,…

Machine Learning · Computer Science 2023-09-12 Mengzhe Ruan , Guangfeng Yan , Yuanzhang Xiao , Linqi Song , Weitao Xu

rTop-k: A Statistical Estimation Approach to Distributed SGD

The large communication cost for exchanging gradients between different nodes significantly limits the scalability of distributed training for large-scale learning models. Motivated by this observation, there has been significant recent…

Machine Learning · Computer Science 2020-12-04 Leighton Pate Barnes , Huseyin A. Inan , Berivan Isik , Ayfer Ozgur

Near-Optimal Sparse Allreduce for Distributed Deep Learning

Communication overhead is one of the major obstacles to train large deep learning models at scale. Gradient sparsification is a promising technique to reduce the communication volume. However, it is very challenging to obtain real…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-22 Shigang Li , Torsten Hoefler

A Distributed Synchronous SGD Algorithm with Global Top-$k$ Sparsification for Low Bandwidth Networks

Distributed synchronous stochastic gradient descent (S-SGD) has been widely used in training large-scale deep neural networks (DNNs), but it typically requires very high communication bandwidth between computational workers (e.g., GPUs) to…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-04-18 Shaohuai Shi , Qiang Wang , Kaiyong Zhao , Zhenheng Tang , Yuxin Wang , Xiang Huang , Xiaowen Chu

Efficient Distributed Training through Gradient Compression with Sparsification and Quantization Techniques

This study investigates the impact of gradient compression on distributed training performance, focusing on sparsification and quantization techniques, including top-k, DGC, and QSGD. In baseline experiments, random-k compression results in…

Machine Learning · Computer Science 2025-02-12 Shruti Singh , Shantanu Kumar

Variance Reduction with Sparse Gradients

Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients to reduce the variance of stochastic gradients. Compared to SGD, these methods require at least double the number of operations per…

Machine Learning · Computer Science 2020-01-28 Melih Elibol , Lihua Lei , Michael I. Jordan

Empirical Analysis on Top-k Gradient Sparsification for Distributed Deep Learning in a Supercomputing Environment

To train deep learning models faster, distributed training on multiple GPUs is the very popular scheme in recent years. However, the communication bandwidth is still a major bottleneck of training performance. To improve overall training…

Machine Learning · Computer Science 2022-09-20 Daegun Yoon , Sangyoon Oh

A PAC-Bayesian Analysis of Randomized Learning with Application to Stochastic Gradient Descent

We study the generalization error of randomized learning algorithms -- focusing on stochastic gradient descent (SGD) -- using a novel combination of PAC-Bayes and algorithmic stability. Importantly, our generalization bounds hold for all…

Machine Learning · Computer Science 2020-06-23 Ben London

Rethinking gradient sparsification as total error minimization

Gradient compression is a widely-established remedy to tackle the communication bottleneck in distributed training of large deep neural networks (DNNs). Under the error-feedback framework, Top-$k$ sparsification, sometimes with $k$ as…

Machine Learning · Computer Science 2021-08-03 Atal Narayan Sahu , Aritra Dutta , Ahmed M. Abdelmoniem , Trambak Banerjee , Marco Canini , Panos Kalnis

Top-$k$ Regularization for Supervised Feature Selection

Feature selection identifies subsets of informative features and reduces dimensions in the original feature space, helping provide insights into data generation or a variety of domain problems. Existing methods mainly depend on feature…

Machine Learning · Computer Science 2021-06-07 Xinxing Wu , Qiang Cheng

GradAug: A New Regularization Method for Deep Neural Networks

We propose a new regularization method to alleviate over-fitting in deep neural networks. The key idea is utilizing randomly transformed training samples to regularize a set of sub-networks, which are originated by sampling the width of the…

Computer Vision and Pattern Recognition · Computer Science 2020-10-14 Taojiannan Yang , Sijie Zhu , Chen Chen

Sparse Spectrum Gaussian Process for Bayesian Optimization

We propose a novel sparse spectrum approximation of Gaussian process (GP) tailored for Bayesian optimization. Whilst the current sparse spectrum methods provide desired approximations for regression problems, it is observed that this…

Machine Learning · Computer Science 2020-06-09 Ang Yang , Cheng Li , Santu Rana , Sunil Gupta , Svetha Venkatesh

AUTOSPARSE: Towards Automated Sparse Training of Deep Neural Networks

Sparse training is emerging as a promising avenue for reducing the computational cost of training neural networks. Several recent studies have proposed pruning methods using learnable thresholds to efficiently explore the non-uniform…

Machine Learning · Computer Science 2023-04-17 Abhisek Kundu , Naveen K. Mellempudi , Dharma Teja Vooturi , Bharat Kaul , Pradeep Dubey

Stochastic Top-k ListNet

ListNet is a well-known listwise learning to rank model and has gained much attention in recent years. A particular problem of ListNet, however, is the high computation complexity in model training, mainly due to the large number of object…

Information Retrieval · Computer Science 2015-11-03 Tianyi Luo , Dong Wang , Rong Liu , Yiqiao Pan

Gradient Sparsification for Communication-Efficient Distributed Optimization

Modern large scale machine learning applications require stochastic optimization algorithms to be implemented on distributed computational architectures. A key bottleneck is the communication overhead for exchanging information such as…

Machine Learning · Computer Science 2017-10-31 Jianqiao Wangni , Jialei Wang , Ji Liu , Tong Zhang

Efficient Neural Network Training via Forward and Backward Propagation Sparsification

Sparse training is a natural idea to accelerate the training speed of deep neural networks and save the memory usage, especially since large modern neural networks are significantly over-parameterized. However, most of the existing methods…

Machine Learning · Computer Science 2021-11-11 Xiao Zhou , Weizhong Zhang , Zonghao Chen , Shizhe Diao , Tong Zhang

An Adaptive Empirical Bayesian Method for Sparse Deep Learning

We propose a novel adaptive empirical Bayesian method for sparse deep learning, where the sparsity is ensured via a class of self-adaptive spike-and-slab priors. The proposed method works by alternatively sampling from an adaptive…

Machine Learning · Statistics 2020-04-15 Wei Deng , Xiao Zhang , Faming Liang , Guang Lin