Related papers: Memory-Efficient Adaptive Optimization

Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization

We introduce MADGRAD, a novel optimization method in the family of AdaGrad adaptive gradient methods. MADGRAD shows excellent performance on deep learning optimization problems from multiple fields, including classification and…

Machine Learning · Computer Science 2021-08-27 Aaron Defazio , Samy Jelassi

CAME: Confidence-guided Adaptive Memory Efficient Optimization

Adaptive gradient methods, such as Adam and LAMB, have demonstrated excellent performance in the training of large language models. Nevertheless, the need for adaptivity requires maintaining second-moment estimates of the per-parameter…

Computation and Language · Computer Science 2023-08-08 Yang Luo , Xiaozhe Ren , Zangwei Zheng , Zhuo Jiang , Xin Jiang , Yang You

Improving Adaptive Moment Optimization via Preconditioner Diagonalization

Modern adaptive optimization methods, such as Adam and its variants, have emerged as the most widely used tools in deep learning over recent years. These algorithms offer automatic mechanisms for dynamically adjusting the update step based…

Machine Learning · Computer Science 2025-02-12 Son Nguyen , Bo Liu , Lizhang Chen , Qiang Liu

Adam: A Method for Stochastic Optimization

We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has…

Machine Learning · Computer Science 2017-01-31 Diederik P. Kingma , Jimmy Ba

Adaptive Optimization for Enhanced Efficiency in Large-Scale Language Model Training

With the rapid development of natural language processing technology, large-scale language models (LLM) have achieved remarkable results in a variety of tasks. However, how to effectively train these huge models and improve their…

Artificial Intelligence · Computer Science 2024-12-09 Jiajing Chen , Bingying Liu , Xiaoxuan Liao , Jia Gao , Hongye Zheng , Yue Li

On the Trend-corrected Variant of Adaptive Stochastic Optimization Methods

Adam-type optimizers, as a class of adaptive moment estimation methods with the exponential moving average scheme, have been successfully used in many applications of deep learning. Such methods are appealing due to the capability on…

Machine Learning · Computer Science 2020-12-17 Bingxin Zhou , Xuebin Zheng , Junbin Gao

AdaLomo: Low-memory Optimization with Adaptive Learning Rate

Large language models have achieved remarkable success, but their extensive parameter size necessitates substantial memory for training, thereby setting a high threshold. While the recently proposed low-memory optimization (LOMO) reduces…

Machine Learning · Computer Science 2024-06-07 Kai Lv , Hang Yan , Qipeng Guo , Haijun Lv , Xipeng Qiu

Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Adaptive optimization methods such as AdaGrad, RMSprop and Adam have been proposed to achieve a rapid training process with an element-wise scaling term on learning rates. Though prevailing, they are observed to generalize poorly compared…

Machine Learning · Computer Science 2019-04-22 Liangchen Luo , Yuanhao Xiong , Yan Liu , Xu Sun

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

In several recently proposed stochastic optimization methods (e.g. RMSProp, Adam, Adadelta), parameter updates are scaled by the inverse square roots of exponential moving averages of squared past gradients. Maintaining these per-parameter…

Machine Learning · Computer Science 2018-04-13 Noam Shazeer , Mitchell Stern

LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics

We introduce LDAdam, a memory-efficient optimizer for training large models, that performs adaptive optimization steps within lower dimensional subspaces, while consistently exploring the full parameter space during training. This strategy…

Machine Learning · Computer Science 2025-03-04 Thomas Robert , Mher Safaryan , Ionut-Vlad Modoranu , Dan Alistarh

The Marginal Value of Adaptive Gradient Methods in Machine Learning

Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We…

Machine Learning · Statistics 2018-05-23 Ashia C. Wilson , Rebecca Roelofs , Mitchell Stern , Nathan Srebro , Benjamin Recht

MADA: Meta-Adaptive Optimizers through hyper-gradient Descent

Following the introduction of Adam, several novel adaptive optimizers for deep learning have been proposed. These optimizers typically excel in some tasks but may not outperform Adam uniformly across all tasks. In this work, we introduce…

Machine Learning · Computer Science 2024-06-18 Kaan Ozkara , Can Karakus , Parameswaran Raman , Mingyi Hong , Shoham Sabach , Branislav Kveton , Volkan Cevher

Optimization Methods in Deep Learning: A Comprehensive Overview

In recent years, deep learning has achieved remarkable success in various fields such as image recognition, natural language processing, and speech recognition. The effectiveness of deep learning largely depends on the optimization methods…

Machine Learning · Computer Science 2023-04-25 David Shulman

Domain-independent Dominance of Adaptive Methods

From a simplified analysis of adaptive methods, we derive AvaGrad, a new optimizer which outperforms SGD on vision tasks when its adaptability is properly tuned. We observe that the power of our method is partially explained by a decoupling…

Machine Learning · Computer Science 2020-03-18 Pedro Savarese , David McAllester , Sudarshan Babu , Michael Maire

A Control Theoretic Framework for Adaptive Gradient Optimizers in Machine Learning

Adaptive gradient methods have become popular in optimizing deep neural networks; recent examples include AdaGrad and Adam. Although Adam usually converges faster, variations of Adam, for instance, the AdaBelief algorithm, have been…

Machine Learning · Computer Science 2024-10-29 Kushal Chakrabarti , Nikhil Chopra

AdaX: Adaptive Gradient Descent with Exponential Long Term Memory

Although adaptive optimization algorithms such as Adam show fast convergence in many machine learning tasks, this paper identifies a problem of Adam by analyzing its performance in a simple non-convex synthetic problem, showing that Adam's…

Machine Learning · Computer Science 2020-05-06 Wenjie Li , Zhaoyang Zhang , Xinjiang Wang , Ping Luo

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence

We propose a new variant of the Adam optimizer called MicroAdam that specifically minimizes memory overheads, while maintaining theoretical convergence guarantees. We achieve this by compressing the gradient information before it is fed…

Machine Learning · Computer Science 2024-11-06 Ionut-Vlad Modoranu , Mher Safaryan , Grigory Malinovsky , Eldar Kurtic , Thomas Robert , Peter Richtarik , Dan Alistarh

Dynamic Low-rank Approximation of Full-Matrix Preconditioner for Training Generalized Linear Models

Adaptive gradient methods like Adagrad and its variants are widespread in large-scale optimization. However, their use of diagonal preconditioning matrices limits the ability to capture parameter correlations. Full-matrix adaptive methods,…

Machine Learning · Computer Science 2025-09-01 Tatyana Matveeva , Aleksandr Katrutsa , Evgeny Frolov

Adaptive Gradient Methods for Constrained Convex Optimization and Variational Inequalities

We provide new adaptive first-order methods for constrained convex optimization. Our main algorithms AdaACSA and AdaAGD+ are accelerated methods, which are universal in the sense that they achieve nearly-optimal convergence rates for both…

Machine Learning · Computer Science 2021-02-17 Alina Ene , Huy L. Nguyen , Adrian Vladu

Scalable Second Order Optimization for Deep Learning

Optimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent. Second-order optimization methods, that involve second derivatives and/or second…

Machine Learning · Computer Science 2021-03-08 Rohan Anil , Vineet Gupta , Tomer Koren , Kevin Regan , Yoram Singer