Related papers: Meta-learning Optimizers for Communication-Efficie…

Understanding Outer Optimizers in Local SGD: Learning Rates, Momentum, and Acceleration

Modern machine learning often requires training with large batch size, distributed data, and massively parallel compute hardware (like mobile and other edge devices or distributed data centers). Communication becomes a major bottleneck in…

Machine Learning · Computer Science 2025-12-12 Ahmed Khaled , Satyen Kale , Arthur Douillard , Chi Jin , Rob Fergus , Manzil Zaheer

Toward Communication Efficient Adaptive Gradient Method

In recent years, distributed optimization is proven to be an effective approach to accelerate training of large scale machine learning models such as deep neural networks. With the increasing computation power of GPUs, the bottleneck of…

Machine Learning · Computer Science 2021-09-14 Xiangyi Chen , Xiaoyun Li , Ping Li

The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication

Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice, including mini-batch SGD. Despite this success, theoretically proving the dominance of local SGD in settings with…

Machine Learning · Computer Science 2024-05-21 Kumar Kshitij Patel , Margalit Glasgow , Ali Zindari , Lingxiao Wang , Sebastian U. Stich , Ziheng Cheng , Nirmit Joshi , Nathan Srebro

Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms

Communication-efficient SGD algorithms, which allow nodes to perform local updates and periodically synchronize local models, are highly effective in improving the speed and scalability of distributed SGD. However, a rigorous convergence…

Machine Learning · Computer Science 2019-01-28 Jianyu Wang , Gauri Joshi

Communication-Efficient Adaptive Batch Size Strategies for Distributed Local Gradient Methods

Modern deep neural networks often require distributed training with many workers due to their large size. As the number of workers increases, communication overheads become the main bottleneck in data-parallel minibatch stochastic gradient…

Machine Learning · Statistics 2024-11-07 Tim Tsz-Kit Lau , Weijian Li , Chenwei Xu , Han Liu , Mladen Kolar

On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization

Recent developments on large-scale distributed machine learning applications, e.g., deep neural networks, benefit enormously from the advances in distributed non-convex optimization techniques, e.g., distributed Stochastic Gradient Descent…

Optimization and Control · Mathematics 2019-05-13 Hao Yu , Rong Jin , Sen Yang

Accelerating Decentralized Optimization via Overlapping Local Steps

Decentralized optimization has emerged as a critical paradigm for distributed learning, enabling scalable training while preserving data privacy through peer-to-peer collaboration. However, existing methods often suffer from communication…

Machine Learning · Computer Science 2026-01-06 Yijie Zhou , Shi Pu

Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization

Large-scale distributed optimization is of great importance in various applications. For data-parallel based distributed learning, the inter-node gradient communication often becomes the performance bottleneck. In this paper, we propose the…

Computer Vision and Pattern Recognition · Computer Science 2018-06-22 Jiaxiang Wu , Weidong Huang , Junzhou Huang , Tong Zhang

Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models: Extension

We consider distributed optimization under communication constraints for training deep learning models. We propose a new algorithm, whose parameter updates rely on two forces: a regular gradient step, and a corrective direction dictated by…

Machine Learning · Computer Science 2022-04-29 Yunfei Teng , Wenbo Gao , Francois Chalus , Anna Choromanska , Donald Goldfarb , Adrian Weller

DQ-SGD: Dynamic Quantization in SGD for Communication-Efficient Distributed Learning

Gradient quantization is an emerging technique in reducing communication costs in distributed learning. Existing gradient quantization algorithms often rely on engineering heuristics or empirical observations, lacking a systematic approach…

Machine Learning · Computer Science 2021-08-02 Guangfeng Yan , Shao-Lun Huang , Tian Lan , Linqi Song

Communication-Efficient Local Decentralized SGD Methods

Recently, the technique of local updates is a powerful tool in centralized settings to improve communication efficiency via periodical communication. For decentralized settings, it is still unclear how to efficiently combine local updates…

Machine Learning · Statistics 2021-04-06 Xiang Li , Wenhao Yang , Shusen Wang , Zhihua Zhang

Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates

When scaling distributed training, the communication overhead is often the bottleneck. In this paper, we propose a novel SGD variant with reduced communication and adaptive learning rates. We prove the convergence of the proposed algorithm…

Machine Learning · Computer Science 2020-12-08 Cong Xie , Oluwasanmi Koyejo , Indranil Gupta , Haibin Lin

Local Stochastic Gradient Descent Ascent: Convergence Analysis and Communication Efficiency

Local SGD is a promising approach to overcome the communication overhead in distributed learning by reducing the synchronization frequency among worker nodes. Despite the recent theoretical advances of local SGD in empirical risk…

Machine Learning · Computer Science 2021-03-01 Yuyang Deng , Mehrdad Mahdavi

Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees

To reduce the long training time of large deep neural network (DNN) models, distributed synchronous stochastic gradient descent (S-SGD) is commonly used on a cluster of workers. However, the speedup brought by multiple workers is limited by…

Machine Learning · Computer Science 2020-03-03 Shaohuai Shi , Zhenheng Tang , Qiang Wang , Kaiyong Zhao , Xiaowen Chu

Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays in Distributed SGD

Distributed stochastic gradient descent (SGD) is essential for scaling the machine learning algorithms to a large number of computing nodes. However, the infrastructures variability such as high communication delay or random node slowdown…

Machine Learning · Computer Science 2020-02-25 Jianyu Wang , Hao Liang , Gauri Joshi

Communication-Efficient Approximate Gradient Coding for Distributed Learning in Heterogeneous Systems

We propose a communication-efficient optimally structured gradient coding scheme to jointly address straggler resilience and communication efficiency in heterogeneous distributed learning. By establishing a unified framework that…

Systems and Control · Electrical Eng. & Systems 2026-05-18 Heekang Song , Wan Choi

Robust and Communication-Efficient Collaborative Learning

We consider a decentralized learning problem, where a set of computing nodes aim at solving a non-convex optimization problem collaboratively. It is well-known that decentralized optimization schemes face two major system bottlenecks:…

Machine Learning · Computer Science 2019-11-04 Amirhossein Reisizadeh , Hossein Taheri , Aryan Mokhtari , Hamed Hassani , Ramtin Pedarsani

Local SGD Converges Fast and Communicates Little

Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training. The scheme can reach a linear speedup with respect to the number of workers, but this is rarely seen in practice as the scheme often…

Optimization and Control · Mathematics 2019-05-06 Sebastian U. Stich

Oscars: Adaptive Semi-Synchronous Parallel Model for Distributed Deep Learning with Global View

Deep learning has become an indispensable part of life, such as face recognition, NLP, etc., but the training of deep model has always been a challenge, and in recent years, the complexity of training data and models has shown explosive…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-02-18 Sheng Huang

Local Methods with Adaptivity via Scaling

The rapid development of machine learning and deep learning has introduced increasingly complex optimization challenges that must be addressed. Indeed, training modern, advanced models has become difficult to implement without leveraging…

Machine Learning · Computer Science 2024-12-03 Savelii Chezhegov , Sergey Skorik , Nikolas Khachaturov , Danil Shalagin , Aram Avetisyan , Martin Takáč , Yaroslav Kholodov , Aleksandr Beznosikov