Related papers: Distributed Sparse SGD with Majority Voting

Sparse Communication for Training Deep Networks

Synchronous stochastic gradient descent (SGD) is the most common method used for distributed training of deep learning models. In this algorithm, each worker shares its local gradients with others and updates the parameters using the…

Machine Learning · Computer Science 2020-09-22 Negar Foroutan Eghlidi , Martin Jaggi

Sparsified SGD with Memory

Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders…

Machine Learning · Computer Science 2018-11-30 Sebastian U. Stich , Jean-Baptiste Cordonnier , Martin Jaggi

Adaptive Top-K in SGD for Communication-Efficient Distributed Learning

Distributed stochastic gradient descent (SGD) with gradient compression has become a popular communication-efficient solution for accelerating distributed learning. One commonly used method for gradient compression is Top-K sparsification,…

Machine Learning · Computer Science 2023-09-12 Mengzhe Ruan , Guangfeng Yan , Yuanzhang Xiao , Linqi Song , Weitao Xu

Understanding Top-k Sparsification in Distributed Deep Learning

Distributed stochastic gradient descent (SGD) algorithms are widely deployed in training large-scale deep learning models, while the communication overhead among workers becomes the new system bottleneck. Recently proposed gradient…

Machine Learning · Computer Science 2019-11-21 Shaohuai Shi , Xiaowen Chu , Ka Chun Cheung , Simon See

rTop-k: A Statistical Estimation Approach to Distributed SGD

The large communication cost for exchanging gradients between different nodes significantly limits the scalability of distributed training for large-scale learning models. Motivated by this observation, there has been significant recent…

Machine Learning · Computer Science 2020-12-04 Leighton Pate Barnes , Huseyin A. Inan , Berivan Isik , Ayfer Ozgur

Efficient Distributed Training through Gradient Compression with Sparsification and Quantization Techniques

This study investigates the impact of gradient compression on distributed training performance, focusing on sparsification and quantization techniques, including top-k, DGC, and QSGD. In baseline experiments, random-k compression results in…

Machine Learning · Computer Science 2025-02-12 Shruti Singh , Shantanu Kumar

Communication-Efficient Distributed SGD with Compressed Sensing

We consider large scale distributed optimization over a set of edge devices connected to a central server, where the limited communication bandwidth between the server and edge devices imposes a significant bottleneck for the optimization…

Optimization and Control · Mathematics 2021-12-28 Yujie Tang , Vikram Ramanathan , Junshan Zhang , Na Li

Empirical Analysis on Top-k Gradient Sparsification for Distributed Deep Learning in a Supercomputing Environment

To train deep learning models faster, distributed training on multiple GPUs is the very popular scheme in recent years. However, the communication bandwidth is still a major bottleneck of training performance. To improve overall training…

Machine Learning · Computer Science 2022-09-20 Daegun Yoon , Sangyoon Oh

Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees

To reduce the long training time of large deep neural network (DNN) models, distributed synchronous stochastic gradient descent (S-SGD) is commonly used on a cluster of workers. However, the speedup brought by multiple workers is limited by…

Machine Learning · Computer Science 2020-03-03 Shaohuai Shi , Zhenheng Tang , Qiang Wang , Kaiyong Zhao , Xiaowen Chu

Sparse-SignSGD with Majority Vote for Communication-Efficient Distributed Learning

The training efficiency of complex deep learning models can be significantly improved through the use of distributed optimization. However, this process is often hindered by a large amount of communication cost between workers and a…

Machine Learning · Computer Science 2023-02-16 Chanho Park , Namyoon Lee

Downlink Compression Improves TopK Sparsification

Training large neural networks is time consuming. To speed up the process, distributed training is often used. One of the largest bottlenecks in distributed training is communicating gradients across different nodes. Different gradient…

Machine Learning · Computer Science 2022-10-03 William Zou , Hans De Sterck , Jun Liu

Communication-Efficient Distributed Learning via Sparse and Adaptive Stochastic Gradient

Gradient-based optimization methods implemented on distributed computing architectures are increasingly used to tackle large-scale machine learning applications. A key bottleneck in such distributed systems is the high communication…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-11 Xiaoge Deng , Dongsheng Li , Tao Sun , Xicheng Lu

Detached Error Feedback for Distributed SGD with Random Sparsification

The communication bottleneck has been a critical problem in large-scale distributed deep learning. In this work, we study distributed SGD with random block-wise sparsification as the gradient compressor, which is ring-allreduce compatible…

Machine Learning · Computer Science 2022-06-14 An Xu , Heng Huang

A Distributed Synchronous SGD Algorithm with Global Top-$k$ Sparsification for Low Bandwidth Networks

Distributed synchronous stochastic gradient descent (S-SGD) has been widely used in training large-scale deep neural networks (DNNs), but it typically requires very high communication bandwidth between computational workers (e.g., GPUs) to…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-04-18 Shaohuai Shi , Qiang Wang , Kaiyong Zhao , Zhenheng Tang , Yuxin Wang , Xiang Huang , Xiaowen Chu

Communication-Censored Distributed Stochastic Gradient Descent

This paper develops a communication-efficient algorithm to solve the stochastic optimization problem defined over a distributed network, aiming at reducing the burdensome communication in applications such as distributed machine…

Machine Learning · Statistics 2020-01-06 Weiyu Li , Tianyi Chen , Liping Li , Zhaoxian Wu , Qing Ling

signSGD with Majority Vote is Communication Efficient And Fault Tolerant

Training neural networks on large datasets can be accelerated by distributing the workload over a network of machines. As datasets grow ever larger, networks of hundreds or thousands of machines become economically viable. The time cost of…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-02-26 Jeremy Bernstein , Jiawei Zhao , Kamyar Azizzadenesheli , Anima Anandkumar

A Distributed SGD Algorithm with Global Sketching for Deep Learning Training Acceleration

Distributed training is an effective way to accelerate the training process of large-scale deep learning models. However, the parameter exchange and synchronization of distributed stochastic gradient descent introduce a large amount of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-08-16 LingFei Dai , Boyu Diao , Chao Li , Yongjun Xu

Gradient Sparification for Asynchronous Distributed Training

Modern large scale machine learning applications require stochastic optimization algorithms to be implemented on distributed computational architectures. A key bottleneck is the communication overhead for exchanging information, such as…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-25 Zijie Yan

Rethinking gradient sparsification as total error minimization

Gradient compression is a widely-established remedy to tackle the communication bottleneck in distributed training of large deep neural networks (DNNs). Under the error-feedback framework, Top-$k$ sparsification, sometimes with $k$ as…

Machine Learning · Computer Science 2021-08-03 Atal Narayan Sahu , Aritra Dutta , Ahmed M. Abdelmoniem , Trambak Banerjee , Marco Canini , Panos Kalnis

Accelerated Sparsified SGD with Error Feedback

A stochastic gradient method for synchronous distributed optimization is studied. For reducing communication cost, we particularly focus on utilization of compression of communicated gradients. Several work has shown that {\it{sparsified}}…

Optimization and Control · Mathematics 2020-06-22 Tomoya Murata , Taiji Suzuki