Related papers: Quantized Adam with Error Feedback

Efficient-Adam: Communication-Efficient Distributed Adam

Distributed adaptive stochastic gradient methods have been widely used for large-scale nonconvex optimization, such as training deep learning models. However, their communication complexity on finding $\varepsilon$-stationary points has…

Machine Learning · Computer Science 2023-08-25 Congliang Chen , Li Shen , Wei Liu , Zhi-Quan Luo

A Distributed Training Algorithm of Generative Adversarial Networks with Quantized Gradients

Training generative adversarial networks (GAN) in a distributed fashion is a promising technology since it is contributed to training GAN on a massive of data efficiently in real-world applications. However, GAN is known to be difficult to…

Machine Learning · Computer Science 2020-10-27 Xiaojun Chen , Shu Yang , Li Shen , Xuanrong Pang

On the Convergence of Decentralized Adaptive Gradient Methods

Adaptive gradient methods including Adam, AdaGrad, and their variants have been very successful for training deep learning models, such as neural networks. Meanwhile, given the need for distributed computing, distributed optimization…

Machine Learning · Computer Science 2021-09-08 Xiangyi Chen , Belhal Karimi , Weijie Zhao , Ping Li

Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization

Due to the explosion in the size of the training datasets, distributed learning has received growing interest in recent years. One of the major bottlenecks is the large communication cost between the central server and the local workers.…

Machine Learning · Computer Science 2022-02-25 Yujia Wang , Lu Lin , Jinghui Chen

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, despite the nice property of fast convergence, have been observed to generalize worse than stochastic gradient descent (SGD)…

Machine Learning · Computer Science 2020-06-24 Jinghui Chen , Dongruo Zhou , Yiqi Tang , Ziyan Yang , Yuan Cao , Quanquan Gu

Revisiting the Initial Steps in Adaptive Gradient Descent Optimization

Adaptive gradient optimization methods, such as Adam, are prevalent in training deep neural networks across diverse machine learning tasks due to their ability to achieve faster convergence. However, these methods often suffer from…

Machine Learning · Computer Science 2025-02-12 Abulikemu Abuduweili , Changliu Liu

Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation,…

Machine Learning · Computer Science 2020-02-10 Boris Ginsburg , Patrice Castonguay , Oleksii Hrinchuk , Oleksii Kuchaiev , Vitaly Lavrukhin , Ryan Leary , Jason Li , Huyen Nguyen , Yang Zhang , Jonathan M. Cohen

Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization

Large-scale distributed optimization is of great importance in various applications. For data-parallel based distributed learning, the inter-node gradient communication often becomes the performance bottleneck. In this paper, we propose the…

Computer Vision and Pattern Recognition · Computer Science 2018-06-22 Jiaxiang Wu , Weidong Huang , Junzhou Huang , Tong Zhang

Rethinking Adam: A Twofold Exponential Moving Average Approach

Adaptive gradient methods, e.g. \textsc{Adam}, have achieved tremendous success in machine learning. Scaling the learning rate element-wisely by a certain form of second moment estimate of gradients, such methods are able to attain rapid…

Machine Learning · Computer Science 2022-02-10 Yizhou Wang , Yue Kang , Can Qin , Huan Wang , Yi Xu , Yulun Zhang , Yun Fu

Fast Convergence Rates of Distributed Subgradient Methods with Adaptive Quantization

We study distributed optimization problems over a network when the communication between the nodes is constrained, and so information that is exchanged between the nodes must be quantized. Recent advances using the distributed gradient…

Optimization and Control · Mathematics 2019-05-14 Thinh T. Doan , Siva Theja Maguluri , Justin Romberg

Quantizing data for distributed learning

We consider machine learning applications that train a model by leveraging data distributed over a trusted network, where communication constraints can create a performance bottleneck. A number of recent approaches propose to overcome this…

Machine Learning · Computer Science 2021-09-10 Osama A. Hanna , Yahya H. Ezzeldin , Christina Fragouli , Suhas Diggavi

Exploiting Adam-like Optimization Algorithms to Improve the Performance of Convolutional Neural Networks

Stochastic gradient descent (SGD) is the main approach for training deep networks: it moves towards the optimum of the cost function by iteratively updating the parameters of a model in the direction of the gradient of the loss evaluated on…

Machine Learning · Computer Science 2021-03-30 Loris Nanni , Gianluca Maguolo , Alessandra Lumini

Communication-Efficient Adam-Type Algorithms for Distributed Data Mining

Distributed data mining is an emerging research topic to effectively and efficiently address hard data mining tasks using big data, which are partitioned and computed on different worker nodes, instead of one centralized server.…

Machine Learning · Computer Science 2022-10-17 Wenhan Xian , Feihu Huang , Heng Huang

On the Trend-corrected Variant of Adaptive Stochastic Optimization Methods

Adam-type optimizers, as a class of adaptive moment estimation methods with the exponential moving average scheme, have been successfully used in many applications of deep learning. Such methods are appealing due to the capability on…

Machine Learning · Computer Science 2020-12-17 Bingxin Zhou , Xuebin Zheng , Junbin Gao

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Adaptive gradient methods for stochastic optimization adjust the learning rate for each parameter locally. However, there is also a global learning rate which must be tuned in order to get the best performance. In this paper, we present a…

Machine Learning · Computer Science 2018-06-12 Hiroaki Hayashi , Jayanth Koushik , Graham Neubig

Modular Distributed Nonconvex Learning with Error Feedback

In this paper, we design a novel distributed learning algorithm using stochastic compressed communications. In detail, we pursue a modular approach, merging ADMM and a gradient-based approach, benefiting from the robustness of the former…

Optimization and Control · Mathematics 2025-07-01 Guido Carnevale , Nicola Bastianello

Adaptive Quantization of Model Updates for Communication-Efficient Federated Learning

Communication of model updates between client nodes and the central aggregating server is a major bottleneck in federated learning, especially in bandwidth-limited settings and high-dimensional models. Gradient quantization is an effective…

Machine Learning · Computer Science 2021-02-10 Divyansh Jhunjhunwala , Advait Gadhikar , Gauri Joshi , Yonina C. Eldar

DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression

A standard approach in large scale machine learning is distributed stochastic gradient training, which requires the computation of aggregated stochastic gradients over multiple nodes on a network. Communication is a major bottleneck in such…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-03-24 Hanlin Tang , Xiangru Lian , Chen Yu , Tong Zhang , Ji Liu

Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks

Adam is a popular and widely used adaptive gradient method in deep learning, which has also received tremendous focus in theoretical research. However, most existing theoretical work primarily analyzes its full-batch version, which differs…

Machine Learning · Computer Science 2025-10-14 Xuan Tang , Han Zhang , Yuan Cao , Difan Zou

Variance-based Gradient Compression for Efficient Distributed Deep Learning

Due to the substantial computational cost, training state-of-the-art deep neural networks for large-scale datasets often requires distributed training using multiple computation workers. However, by nature, workers need to frequently…

Machine Learning · Computer Science 2018-02-21 Yusuke Tsuzuku , Hiroto Imachi , Takuya Akiba