Related papers: Meta-learning Optimizers for Communication-Efficie…
Modern machine learning often requires training with large batch size, distributed data, and massively parallel compute hardware (like mobile and other edge devices or distributed data centers). Communication becomes a major bottleneck in…
In recent years, distributed optimization is proven to be an effective approach to accelerate training of large scale machine learning models such as deep neural networks. With the increasing computation power of GPUs, the bottleneck of…
Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice, including mini-batch SGD. Despite this success, theoretically proving the dominance of local SGD in settings with…
Communication-efficient SGD algorithms, which allow nodes to perform local updates and periodically synchronize local models, are highly effective in improving the speed and scalability of distributed SGD. However, a rigorous convergence…
Modern deep neural networks often require distributed training with many workers due to their large size. As the number of workers increases, communication overheads become the main bottleneck in data-parallel minibatch stochastic gradient…
Recent developments on large-scale distributed machine learning applications, e.g., deep neural networks, benefit enormously from the advances in distributed non-convex optimization techniques, e.g., distributed Stochastic Gradient Descent…
Decentralized optimization has emerged as a critical paradigm for distributed learning, enabling scalable training while preserving data privacy through peer-to-peer collaboration. However, existing methods often suffer from communication…
Large-scale distributed optimization is of great importance in various applications. For data-parallel based distributed learning, the inter-node gradient communication often becomes the performance bottleneck. In this paper, we propose the…
We consider distributed optimization under communication constraints for training deep learning models. We propose a new algorithm, whose parameter updates rely on two forces: a regular gradient step, and a corrective direction dictated by…
Gradient quantization is an emerging technique in reducing communication costs in distributed learning. Existing gradient quantization algorithms often rely on engineering heuristics or empirical observations, lacking a systematic approach…
Recently, the technique of local updates is a powerful tool in centralized settings to improve communication efficiency via periodical communication. For decentralized settings, it is still unclear how to efficiently combine local updates…
When scaling distributed training, the communication overhead is often the bottleneck. In this paper, we propose a novel SGD variant with reduced communication and adaptive learning rates. We prove the convergence of the proposed algorithm…
Local SGD is a promising approach to overcome the communication overhead in distributed learning by reducing the synchronization frequency among worker nodes. Despite the recent theoretical advances of local SGD in empirical risk…
To reduce the long training time of large deep neural network (DNN) models, distributed synchronous stochastic gradient descent (S-SGD) is commonly used on a cluster of workers. However, the speedup brought by multiple workers is limited by…
Distributed stochastic gradient descent (SGD) is essential for scaling the machine learning algorithms to a large number of computing nodes. However, the infrastructures variability such as high communication delay or random node slowdown…
We propose a communication-efficient optimally structured gradient coding scheme to jointly address straggler resilience and communication efficiency in heterogeneous distributed learning. By establishing a unified framework that…
We consider a decentralized learning problem, where a set of computing nodes aim at solving a non-convex optimization problem collaboratively. It is well-known that decentralized optimization schemes face two major system bottlenecks:…
Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training. The scheme can reach a linear speedup with respect to the number of workers, but this is rarely seen in practice as the scheme often…
Deep learning has become an indispensable part of life, such as face recognition, NLP, etc., but the training of deep model has always been a challenge, and in recent years, the complexity of training data and models has shown explosive…
The rapid development of machine learning and deep learning has introduced increasingly complex optimization challenges that must be addressed. Indeed, training modern, advanced models has become difficult to implement without leveraging…