Related papers: DADAM: A Consensus-based Distributed Adaptive Grad…

GTAdam: Gradient Tracking with Adaptive Momentum for Distributed Online Optimization

This paper deals with a network of computing agents aiming to solve an online optimization problem in a distributed fashion, i.e., by means of local computation and communication, without any central coordinator. We propose the gradient…

Optimization and Control · Mathematics 2023-09-13 Guido Carnevale , Francesco Farina , Ivano Notarnicola , Giuseppe Notarstefano

On the Convergence of Decentralized Adaptive Gradient Methods

Adaptive gradient methods including Adam, AdaGrad, and their variants have been very successful for training deep learning models, such as neural networks. Meanwhile, given the need for distributed computing, distributed optimization…

Machine Learning · Computer Science 2021-09-08 Xiangyi Chen , Belhal Karimi , Weijie Zhao , Ping Li

Adam: A Method for Stochastic Optimization

We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has…

Machine Learning · Computer Science 2017-01-31 Diederik P. Kingma , Jimmy Ba

A Decentralized Adaptive Momentum Method for Solving a Class of Min-Max Optimization Problems

Min-max saddle point games have recently been intensely studied, due to their wide range of applications, including training Generative Adversarial Networks (GANs). However, most of the recent efforts for solving them are limited to special…

Optimization and Control · Mathematics 2021-08-10 Babak Barazandeh , Tianjian Huang , George Michailidis

Rethinking Adam: A Twofold Exponential Moving Average Approach

Adaptive gradient methods, e.g. \textsc{Adam}, have achieved tremendous success in machine learning. Scaling the learning rate element-wisely by a certain form of second moment estimate of gradients, such methods are able to attain rapid…

Machine Learning · Computer Science 2022-02-10 Yizhou Wang , Yue Kang , Can Qin , Huan Wang , Yi Xu , Yulun Zhang , Yun Fu

CAdam: Confidence-Based Optimization for Online Learning

Modern recommendation systems frequently employ online learning to dynamically update their models with freshly collected data. The most commonly used optimizer for updating neural networks in these contexts is the Adam optimizer, which…

Machine Learning · Computer Science 2025-06-05 Shaowen Wang , Anan Liu , Jian Xiao , Huan Liu , Yuekui Yang , Cong Xu , Qianqian Pu , Suncong Zheng , Wei Zhang , Di Wang , Jie Jiang , Jian Li

ADMM-Tracking Gradient for Distributed Optimization over Asynchronous and Unreliable Networks

In this paper, we propose a novel distributed algorithm for consensus optimization over networks and a robust extension tailored to deal with asynchronous agents and packet losses. Indeed, to robustly achieve dynamic consensus on the…

Optimization and Control · Mathematics 2025-09-04 Guido Carnevale , Nicola Bastianello , Giuseppe Notarstefano , Ruggero Carli

Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Adaptive optimization methods such as AdaGrad, RMSprop and Adam have been proposed to achieve a rapid training process with an element-wise scaling term on learning rates. Though prevailing, they are observed to generalize poorly compared…

Machine Learning · Computer Science 2019-04-22 Liangchen Luo , Yuanhao Xiong , Yan Liu , Xu Sun

Divergence Results and Convergence of a Variance Reduced Version of ADAM

Stochastic optimization algorithms using exponential moving averages of the past gradients, such as ADAM, RMSProp and AdaGrad, have been having great successes in many applications, especially in training deep neural networks. ADAM in…

Machine Learning · Computer Science 2026-01-30 Ruiqi Wang , Diego Klabjan

A Control Theoretic Framework for Adaptive Gradient Optimizers in Machine Learning

Adaptive gradient methods have become popular in optimizing deep neural networks; recent examples include AdaGrad and Adam. Although Adam usually converges faster, variations of Adam, for instance, the AdaBelief algorithm, have been…

Machine Learning · Computer Science 2024-10-29 Kushal Chakrabarti , Nikhil Chopra

Local Convergence of Adaptive Gradient Descent Optimizers

Adaptive Moment Estimation (ADAM) is a very popular training algorithm for deep neural networks and belongs to the family of adaptive gradient descent optimizers. However to the best of the authors knowledge no complete convergence analysis…

Machine Learning · Computer Science 2021-02-22 Sebastian Bock , Martin Georg Weiß

Revisiting the Initial Steps in Adaptive Gradient Descent Optimization

Adaptive gradient optimization methods, such as Adam, are prevalent in training deep neural networks across diverse machine learning tasks due to their ability to achieve faster convergence. However, these methods often suffer from…

Machine Learning · Computer Science 2025-02-12 Abulikemu Abuduweili , Changliu Liu

Scaling Distributed Training with Adaptive Summation

Stochastic gradient descent (SGD) is an inherently sequential training algorithm--computing the gradient at batch $i$ depends on the model parameters learned from batch $i-1$. Prior approaches that break this dependence do not honor them…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-05 Saeed Maleki , Madan Musuvathi , Todd Mytkowicz , Olli Saarikivi , Tianju Xu , Vadim Eksarevskiy , Jaliya Ekanayake , Emad Barsoum

Adaptive Subgradient Methods for Online AUC Maximization

Learning for maximizing AUC performance is an important research problem in Machine Learning and Artificial Intelligence. Unlike traditional batch learning methods for maximizing AUC which often suffer from poor scalability, recent years…

Machine Learning · Computer Science 2016-02-02 Yi Ding , Peilin Zhao , Steven C. H. Hoi , Yew-Soon Ong

Asynchronous Distributed ADMM for Large-Scale Optimization- Part I: Algorithm and Convergence Analysis

Aiming at solving large-scale learning problems, this paper studies distributed optimization methods based on the alternating direction method of multipliers (ADMM). By formulating the learning problem as a consensus problem, the ADMM can…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-05-04 Tsung-Hui Chang , Mingyi Hong , Wei-Cheng Liao , Xiangfeng Wang

CADA: Communication-Adaptive Distributed Adam

Stochastic gradient descent (SGD) has taken the stage as the primary workhorse for large-scale machine learning. It is often used with its adaptive variants such as AdaGrad, Adam, and AMSGrad. This paper proposes an adaptive stochastic…

Machine Learning · Computer Science 2021-01-01 Tianyi Chen , Ziye Guo , Yuejiao Sun , Wotao Yin

On the Trend-corrected Variant of Adaptive Stochastic Optimization Methods

Adam-type optimizers, as a class of adaptive moment estimation methods with the exponential moving average scheme, have been successfully used in many applications of deep learning. Such methods are appealing due to the capability on…

Machine Learning · Computer Science 2020-12-17 Bingxin Zhou , Xuebin Zheng , Junbin Gao

Tom: Leveraging trend of the observed gradients for faster convergence

The success of deep learning can be attributed to various factors such as increase in computational power, large datasets, deep convolutional neural networks, optimizers etc. Particularly, the choice of optimizer affects the generalization,…

Machine Learning · Computer Science 2021-09-10 Anirudh Maiya , Inumella Sricharan , Anshuman Pandey , Srinivas K. S

Double Adaptive Stochastic Gradient Optimization

Adaptive moment methods have been remarkably successful in deep learning optimization, particularly in the presence of noisy and/or sparse gradients. We further the advantages of adaptive moment techniques by proposing a family of double…

Machine Learning · Statistics 2018-11-07 Kin Gutierrez , Jin Li , Cristian Challu , Artur Dubrawski

A decreasing scaling transition scheme from Adam to SGD

Adaptive gradient algorithm (AdaGrad) and its variants, such as RMSProp, Adam, AMSGrad, etc, have been widely used in deep learning. Although these algorithms are faster in the early phase of training, their generalization performance is…

Machine Learning · Computer Science 2021-09-14 Kun Zeng , Jinlan Liu , Zhixia Jiang , Dongpo Xu