Related papers: A New Adaptive Gradient Method with Gradient Decom…

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, despite the nice property of fast convergence, have been observed to generalize worse than stochastic gradient descent (SGD)…

Machine Learning · Computer Science 2020-06-24 Jinghui Chen , Dongruo Zhou , Yiqi Tang , Ziyan Yang , Yuan Cao , Quanquan Gu

Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Adaptive optimization methods such as AdaGrad, RMSprop and Adam have been proposed to achieve a rapid training process with an element-wise scaling term on learning rates. Though prevailing, they are observed to generalize poorly compared…

Machine Learning · Computer Science 2019-04-22 Liangchen Luo , Yuanhao Xiong , Yan Liu , Xu Sun

On the Convergence of Decentralized Adaptive Gradient Methods

Adaptive gradient methods including Adam, AdaGrad, and their variants have been very successful for training deep learning models, such as neural networks. Meanwhile, given the need for distributed computing, distributed optimization…

Machine Learning · Computer Science 2021-09-08 Xiangyi Chen , Belhal Karimi , Weijie Zhao , Ping Li

Adaptive Gradient Method with Resilience and Momentum

Several variants of stochastic gradient descent (SGD) have been proposed to improve the learning effectiveness and efficiency when training deep neural networks, among which some recent influential attempts would like to adaptively control…

Machine Learning · Computer Science 2020-10-22 Jie Liu , Chen Lin , Chuming Li , Lu Sheng , Ming Sun , Junjie Yan , Wanli Ouyang

Exploiting Adam-like Optimization Algorithms to Improve the Performance of Convolutional Neural Networks

Stochastic gradient descent (SGD) is the main approach for training deep networks: it moves towards the optimum of the cost function by iteratively updating the parameters of a model in the direction of the gradient of the loss evaluated on…

Machine Learning · Computer Science 2021-03-30 Loris Nanni , Gianluca Maguolo , Alessandra Lumini

Learning rate adaptive stochastic gradient descent optimization methods: numerical simulations for deep learning methods for partial differential equations and convergence analyses

It is known that the standard stochastic gradient descent (SGD) optimization method, as well as accelerated and adaptive SGD optimization methods such as the Adam optimizer fail to converge if the learning rates do not converge to zero (as,…

Optimization and Control · Mathematics 2024-06-21 Steffen Dereich , Arnulf Jentzen , Adrian Riekert

AdaSGD: Bridging the gap between SGD and Adam

In the context of stochastic gradient descent(SGD) and adaptive moment estimation (Adam),researchers have recently proposed optimization techniques that transition from Adam to SGD with the goal of improving both convergence and…

Machine Learning · Computer Science 2020-07-01 Jiaxuan Wang , Jenna Wiens

The Marginal Value of Adaptive Gradient Methods in Machine Learning

Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We…

Machine Learning · Statistics 2018-05-23 Ashia C. Wilson , Rebecca Roelofs , Mitchell Stern , Nathan Srebro , Benjamin Recht

Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates

Deep learning algorithms - typically consisting of a class of deep neural networks trained by a stochastic gradient descent (SGD) optimization method - are nowadays the key ingredients in many artificial intelligence (AI) systems and have…

Machine Learning · Computer Science 2024-07-12 Steffen Dereich , Robin Graeber , Arnulf Jentzen

Rethinking Adam: A Twofold Exponential Moving Average Approach

Adaptive gradient methods, e.g. \textsc{Adam}, have achieved tremendous success in machine learning. Scaling the learning rate element-wisely by a certain form of second moment estimate of gradients, such methods are able to attain rapid…

Machine Learning · Computer Science 2022-02-10 Yizhou Wang , Yue Kang , Can Qin , Huan Wang , Yi Xu , Yulun Zhang , Yun Fu

Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation,…

Machine Learning · Computer Science 2020-02-10 Boris Ginsburg , Patrice Castonguay , Oleksii Hrinchuk , Oleksii Kuchaiev , Vitaly Lavrukhin , Ryan Leary , Jason Li , Huyen Nguyen , Yang Zhang , Jonathan M. Cohen

Scaling Distributed Training with Adaptive Summation

Stochastic gradient descent (SGD) is an inherently sequential training algorithm--computing the gradient at batch $i$ depends on the model parameters learned from batch $i-1$. Prior approaches that break this dependence do not honor them…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-05 Saeed Maleki , Madan Musuvathi , Todd Mytkowicz , Olli Saarikivi , Tianju Xu , Vadim Eksarevskiy , Jaliya Ekanayake , Emad Barsoum

A Control Theoretic Framework for Adaptive Gradient Optimizers in Machine Learning

Adaptive gradient methods have become popular in optimizing deep neural networks; recent examples include AdaGrad and Adam. Although Adam usually converges faster, variations of Adam, for instance, the AdaBelief algorithm, have been…

Machine Learning · Computer Science 2024-10-29 Kushal Chakrabarti , Nikhil Chopra

Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM

Adaptive gradient methods (AGMs) have become popular in optimizing the nonconvex problems in deep learning area. We revisit AGMs and identify that the adaptive learning rate (A-LR) used by AGMs varies significantly across the dimensions of…

Machine Learning · Computer Science 2019-09-12 Qianqian Tong , Guannan Liang , Jinbo Bi

Optimal Adaptive and Accelerated Stochastic Gradient Descent

Stochastic gradient descent (\textsc{Sgd}) methods are the most powerful optimization tools in training machine learning and deep learning models. Moreover, acceleration (a.k.a. momentum) methods and diagonal scaling (a.k.a. adaptive…

Machine Learning · Statistics 2018-10-02 Qi Deng , Yi Cheng , Guanghui Lan

Convergence rates for the Adam optimizer

Stochastic gradient descent (SGD) optimization methods are nowadays the method of choice for the training of deep neural networks (DNNs) in artificial intelligence systems. In practically relevant training problems, usually not the plain…

Optimization and Control · Mathematics 2024-08-01 Steffen Dereich , Arnulf Jentzen

A New Accelerated Stochastic Gradient Method with Momentum

In this paper, we propose a novel accelerated stochastic gradient method with momentum, which momentum is the weighted average of previous gradients. The weights decays inverse proportionally with the iteration times. Stochastic gradient…

Machine Learning · Computer Science 2020-06-02 Liang Liu , Xiaopeng Luo

On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization

This paper studies a class of adaptive gradient based momentum algorithms that update the search directions and learning rates simultaneously using past gradients. This class, which we refer to as the "Adam-type", includes the popular…

Machine Learning · Computer Science 2019-03-12 Xiangyi Chen , Sijia Liu , Ruoyu Sun , Mingyi Hong

CADA: Communication-Adaptive Distributed Adam

Stochastic gradient descent (SGD) has taken the stage as the primary workhorse for large-scale machine learning. It is often used with its adaptive variants such as AdaGrad, Adam, and AMSGrad. This paper proposes an adaptive stochastic…

Machine Learning · Computer Science 2021-01-01 Tianyi Chen , Ziye Guo , Yuejiao Sun , Wotao Yin

Revisiting the Initial Steps in Adaptive Gradient Descent Optimization

Adaptive gradient optimization methods, such as Adam, are prevalent in training deep neural networks across diverse machine learning tasks due to their ability to achieve faster convergence. However, these methods often suffer from…

Machine Learning · Computer Science 2025-02-12 Abulikemu Abuduweili , Changliu Liu