Related papers: CAME: Confidence-guided Adaptive Memory Efficient …

Memory-Efficient Adaptive Optimization

Adaptive gradient-based optimizers such as Adagrad and Adam are crucial for achieving state-of-the-art performance in machine translation and language modeling. However, these methods maintain second-order statistics for each parameter,…

Machine Learning · Computer Science 2019-09-13 Rohan Anil , Vineet Gupta , Tomer Koren , Yoram Singer

Rapidly Adapting Moment Estimation

Adaptive gradient methods such as Adam have been shown to be very effective for training deep neural networks (DNNs) by tracking the second moment of gradients to compute the individual learning rates. Differently from existing methods, we…

Machine Learning · Computer Science 2019-02-26 Guoqiang Zhang , Kenta Niwa , W. Bastiaan Kleijn

AdaLomo: Low-memory Optimization with Adaptive Learning Rate

Large language models have achieved remarkable success, but their extensive parameter size necessitates substantial memory for training, thereby setting a high threshold. While the recently proposed low-memory optimization (LOMO) reduces…

Machine Learning · Computer Science 2024-06-07 Kai Lv , Hang Yan , Qipeng Guo , Haijun Lv , Xipeng Qiu

Confident Adaptive Language Modeling

Recent advances in Transformer-based large language models (LLMs) have led to significant performance improvements across many tasks. These gains come with a drastic increase in the models' size, potentially leading to slow and costly use…

Computation and Language · Computer Science 2022-10-26 Tal Schuster , Adam Fisch , Jai Gupta , Mostafa Dehghani , Dara Bahri , Vinh Q. Tran , Yi Tay , Donald Metzler

Memory-Efficient LLM Pretraining via Minimalist Optimizer Design

Training large language models (LLMs) relies on adaptive optimizers such as Adam, which introduce extra operations and require significantly more memory to maintain first- and second-order moments than SGD. While recent works such as…

Machine Learning · Computer Science 2026-05-22 Athanasios Glentis , Jiaxiang Li , Andi Han , Mingyi Hong

Divergence Results and Convergence of a Variance Reduced Version of ADAM

Stochastic optimization algorithms using exponential moving averages of the past gradients, such as ADAM, RMSProp and AdaGrad, have been having great successes in many applications, especially in training deep neural networks. ADAM in…

Machine Learning · Computer Science 2026-01-30 Ruiqi Wang , Diego Klabjan

Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices

As deep learning models exponentially increase in size, optimizers such as Adam encounter significant memory consumption challenges due to the storage of first and second moment data. Current memory-efficient methods like Adafactor and CAME…

Machine Learning · Computer Science 2024-03-25 Pengxiang Zhao , Ping Li , Yingjie Gu , Yi Zheng , Stephan Ludger Kölker , Zhefeng Wang , Xiaoming Yuan

A Control Theoretic Framework for Adaptive Gradient Optimizers in Machine Learning

Adaptive gradient methods have become popular in optimizing deep neural networks; recent examples include AdaGrad and Adam. Although Adam usually converges faster, variations of Adam, for instance, the AdaBelief algorithm, have been…

Machine Learning · Computer Science 2024-10-29 Kushal Chakrabarti , Nikhil Chopra

CAdam: Confidence-Based Optimization for Online Learning

Modern recommendation systems frequently employ online learning to dynamically update their models with freshly collected data. The most commonly used optimizer for updating neural networks in these contexts is the Adam optimizer, which…

Machine Learning · Computer Science 2025-06-05 Shaowen Wang , Anan Liu , Jian Xiao , Huan Liu , Yuekui Yang , Cong Xu , Qianqian Pu , Suncong Zheng , Wei Zhang , Di Wang , Jie Jiang , Jian Li

Adaptive Optimization for Enhanced Efficiency in Large-Scale Language Model Training

With the rapid development of natural language processing technology, large-scale language models (LLM) have achieved remarkable results in a variety of tasks. However, how to effectively train these huge models and improve their…

Artificial Intelligence · Computer Science 2024-12-09 Jiajing Chen , Bingying Liu , Xiaoxuan Liao , Jia Gao , Hongye Zheng , Yue Li

Backward-Friendly Optimization: Training Large Language Models with Approximate Gradients under Memory Constraints

Full fine-tuning of Large Language Models (LLMs) is notoriously memory-intensive, primarily because conventional optimizers such as SGD or Adam assume access to exact gradients derived from cached activations. Existing solutions either…

Machine Learning · Computer Science 2025-10-28 Jing Yang , Kaitong Cai , Yijia Fan , Yufeng Yang , Keze Wang

1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed

To train large models (like BERT and GPT-3) on hundreds of GPUs, communication has become a major bottleneck, especially on commodity systems with limited-bandwidth TCP network. On one side large batch-size optimization such as LAMB…

Machine Learning · Computer Science 2021-10-07 Conglong Li , Ammar Ahmad Awan , Hanlin Tang , Samyam Rajbhandari , Yuxiong He

Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale

We present Amos, a stochastic gradient-based optimizer designed for training deep neural networks. It can be viewed as an Adam optimizer with theoretically supported, adaptive learning-rate decay and weight decay. A key insight behind Amos…

Machine Learning · Computer Science 2022-11-22 Ran Tian , Ankur P. Parikh

PowerStep: Memory-Efficient Adaptive Optimization via $\ell_p$-Norm Steepest Descent

Adaptive optimizers, most notably Adam, have become the default standard for training large-scale neural networks such as Transformers. These methods maintain running estimates of gradient first and second moments, incurring substantial…

Machine Learning · Computer Science 2026-05-12 Yao Lu , Dengdong Fan , Shixun Zhang , Yonghong Tian

AdaX: Adaptive Gradient Descent with Exponential Long Term Memory

Although adaptive optimization algorithms such as Adam show fast convergence in many machine learning tasks, this paper identifies a problem of Adam by analyzing its performance in a simple non-convex synthetic problem, showing that Adam's…

Machine Learning · Computer Science 2020-05-06 Wenjie Li , Zhaoyang Zhang , Xinjiang Wang , Ping Luo

A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models

The memory challenges associated with training Large Language Models (LLMs) have become a critical concern, particularly when using the Adam optimizer. To address this issue, numerous memory-efficient techniques have been proposed, with…

Machine Learning · Computer Science 2025-02-12 Yiming Chen , Yuan Zhang , Yin Liu , Kun Yuan , Zaiwen Wen

1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed

Scalable training of large models (like BERT and GPT-3) requires careful optimization rooted in model design, architecture, and system capabilities. From a system standpoint, communication has become a major bottleneck, especially on…

Machine Learning · Computer Science 2021-07-01 Hanlin Tang , Shaoduo Gan , Ammar Ahmad Awan , Samyam Rajbhandari , Conglong Li , Xiangru Lian , Ji Liu , Ce Zhang , Yuxiong He

Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Adaptive optimization methods such as AdaGrad, RMSprop and Adam have been proposed to achieve a rapid training process with an element-wise scaling term on learning rates. Though prevailing, they are observed to generalize poorly compared…

Machine Learning · Computer Science 2019-04-22 Liangchen Luo , Yuanhao Xiong , Yan Liu , Xu Sun

FOAM: Blocked State Folding for Memory-Efficient LLM Training

Large language models (LLMs) have demonstrated remarkable performance due to their large parameter counts and extensive training data. However, their scale leads to significant memory bottlenecks during training, especially when using…

Machine Learning · Computer Science 2026-05-14 Ziqing Wen , Jiahuan Wang , Ping Luo , Dongsheng Li , Tao Sun

Adam: A Method for Stochastic Optimization

We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has…

Machine Learning · Computer Science 2017-01-31 Diederik P. Kingma , Jimmy Ba