Related papers: Improving Adaptive Moment Optimization via Precond…

Revisiting the Initial Steps in Adaptive Gradient Descent Optimization

Adaptive gradient optimization methods, such as Adam, are prevalent in training deep neural networks across diverse machine learning tasks due to their ability to achieve faster convergence. However, these methods often suffer from…

Machine Learning · Computer Science 2025-02-12 Abulikemu Abuduweili , Changliu Liu

On the Trend-corrected Variant of Adaptive Stochastic Optimization Methods

Adam-type optimizers, as a class of adaptive moment estimation methods with the exponential moving average scheme, have been successfully used in many applications of deep learning. Such methods are appealing due to the capability on…

Machine Learning · Computer Science 2020-12-17 Bingxin Zhou , Xuebin Zheng , Junbin Gao

AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for Preconditioning Matrix

Adaptive optimizers, such as Adam, have achieved remarkable success in deep learning. A key component of these optimizers is the so-called preconditioning matrix, providing enhanced gradient information and regulating the step size of each…

Machine Learning · Computer Science 2024-12-10 Yun Yue , Zhiling Ye , Jiadi Jiang , Yongchao Liu , Ke Zhang

Memory-Efficient Adaptive Optimization

Adaptive gradient-based optimizers such as Adagrad and Adam are crucial for achieving state-of-the-art performance in machine translation and language modeling. However, these methods maintain second-order statistics for each parameter,…

Machine Learning · Computer Science 2019-09-13 Rohan Anil , Vineet Gupta , Tomer Koren , Yoram Singer

Rethinking Adam: A Twofold Exponential Moving Average Approach

Adaptive gradient methods, e.g. \textsc{Adam}, have achieved tremendous success in machine learning. Scaling the learning rate element-wisely by a certain form of second moment estimate of gradients, such methods are able to attain rapid…

Machine Learning · Computer Science 2022-02-10 Yizhou Wang , Yue Kang , Can Qin , Huan Wang , Yi Xu , Yulun Zhang , Yun Fu

Adam: A Method for Stochastic Optimization

We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has…

Machine Learning · Computer Science 2017-01-31 Diederik P. Kingma , Jimmy Ba

WarpAdam: A new Adam optimizer based on Meta-Learning approach

Optimal selection of optimization algorithms is crucial for training deep learning models. The Adam optimizer has gained significant attention due to its efficiency and wide applicability. However, to enhance the adaptability of optimizers…

Machine Learning · Computer Science 2024-09-09 Chengxi Pan , Junshang Chen , Jingrui Ye

Adam$^+$: A Stochastic Method with Adaptive Variance Reduction

Adam is a widely used stochastic optimization method for deep learning applications. While practitioners prefer Adam because it requires less parameter tuning, its use is problematic from a theoretical point of view since it may not…

Machine Learning · Computer Science 2020-11-25 Mingrui Liu , Wei Zhang , Francesco Orabona , Tianbao Yang

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, despite the nice property of fast convergence, have been observed to generalize worse than stochastic gradient descent (SGD)…

Machine Learning · Computer Science 2020-06-24 Jinghui Chen , Dongruo Zhou , Yiqi Tang , Ziyan Yang , Yuan Cao , Quanquan Gu

Dynamic Low-rank Approximation of Full-Matrix Preconditioner for Training Generalized Linear Models

Adaptive gradient methods like Adagrad and its variants are widespread in large-scale optimization. However, their use of diagonal preconditioning matrices limits the ability to capture parameter correlations. Full-matrix adaptive methods,…

Machine Learning · Computer Science 2025-09-01 Tatyana Matveeva , Aleksandr Katrutsa , Evgeny Frolov

A Control Theoretic Framework for Adaptive Gradient Optimizers in Machine Learning

Adaptive gradient methods have become popular in optimizing deep neural networks; recent examples include AdaGrad and Adam. Although Adam usually converges faster, variations of Adam, for instance, the AdaBelief algorithm, have been…

Machine Learning · Computer Science 2024-10-29 Kushal Chakrabarti , Nikhil Chopra

MADA: Meta-Adaptive Optimizers through hyper-gradient Descent

Following the introduction of Adam, several novel adaptive optimizers for deep learning have been proposed. These optimizers typically excel in some tasks but may not outperform Adam uniformly across all tasks. In this work, we introduce…

Machine Learning · Computer Science 2024-06-18 Kaan Ozkara , Can Karakus , Parameswaran Raman , Mingyi Hong , Shoham Sabach , Branislav Kveton , Volkan Cevher

Adaptive Memory Momentum via a Model-Based Framework for Deep Learning Optimization

The vast majority of modern deep learning models are trained with momentum-based first-order optimizers. The momentum term governs the optimizer's memory by determining how much each past gradient contributes to the current convergence…

Machine Learning · Computer Science 2026-05-12 Kristi Topollai , Anna Choromanska

Adam or Gauss-Newton? A Comparative Study In Terms of Basis Alignment and SGD Noise

Diagonal preconditioners are computationally feasible approximate to second-order optimizers, which have shown significant promise in accelerating training of deep learning models. Two predominant approaches are based on Adam and…

Machine Learning · Computer Science 2025-10-16 Bingbin Liu , Rachit Bansal , Depen Morwani , Nikhil Vyas , David Alvarez-Melis , Sham M. Kakade

Adaptive Optimization with Examplewise Gradients

We propose a new, more general approach to the design of stochastic gradient-based optimization methods for machine learning. In this new framework, optimizers assume access to a batch of gradient estimates per iteration, rather than a…

Machine Learning · Computer Science 2021-12-02 Julius Kunze , James Townsend , David Barber

On Higher-order Moments in Adam

In this paper, we investigate the popular deep learning optimization routine, Adam, from the perspective of statistical moments. While Adam is an adaptive lower-order moment based (of the stochastic gradient) method, we propose an extension…

Machine Learning · Computer Science 2019-10-16 Zhanhong Jiang , Aditya Balu , Sin Yong Tan , Young M Lee , Chinmay Hegde , Soumik Sarkar

Combining Adam and its Inverse Counterpart to Enhance Generalization of Deep Learning Optimizers

In the training of neural networks, adaptive moment estimation (Adam) typically converges fast but exhibits suboptimal generalization performance. A widely accepted explanation for its defect in generalization is that it often tends to…

Machine Learning · Computer Science 2026-03-10 Tao Shi , Liangming Chen , Long Jin , Mengchu Zhou

Local Convergence of Adaptive Gradient Descent Optimizers

Adaptive Moment Estimation (ADAM) is a very popular training algorithm for deep neural networks and belongs to the family of adaptive gradient descent optimizers. However to the best of the authors knowledge no complete convergence analysis…

Machine Learning · Computer Science 2021-02-22 Sebastian Bock , Martin Georg Weiß

Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods

Optimization lies at the core of modern deep learning, yet existing methods often face a fundamental trade-off between adapting to problem geometry and leveraging curvature utilization. Steepest descent algorithms adapt to different…

Machine Learning · Computer Science 2026-05-19 Andrey Veprikov , Arman Bolatov , Aleksandr Bogdanov , Samuel Horváth , Aleksandr Beznosikov , Martin Takáč , Slavomir Hanzely

Narrowing the Focus: Learned Optimizers for Pretrained Models

In modern deep learning, the models are learned by applying gradient updates using an optimizer, which transforms the updates based on various statistics. Optimizers are often hand-designed and tuning their hyperparameters is a big part of…

Machine Learning · Computer Science 2024-10-08 Gus Kristiansen , Mark Sandler , Andrey Zhmoginov , Nolan Miller , Anirudh Goyal , Jihwan Lee , Max Vladymyrov