Related papers: FlashOptim: Optimizers for Memory-Efficient Traini…

Memory Efficient Mixed-Precision Optimizers

Traditional optimization methods rely on the use of single-precision floating point arithmetic, which can be costly in terms of memory size and computing power. However, mixed precision optimization techniques leverage the use of both…

Machine Learning · Computer Science 2023-09-25 Basile Lewandowski , Atli Kosson

How to Fine-Tune Vision Models with SGD

SGD and AdamW are the two most used optimizers for fine-tuning large neural networks in computer vision. When the two methods perform the same, SGD is preferable because it uses less memory (12 bytes/parameter with momentum and 8…

Computer Vision and Pattern Recognition · Computer Science 2023-10-11 Ananya Kumar , Ruoqi Shen , Sebastien Bubeck , Suriya Gunasekar

Memory Efficient Optimizers with 4-bit States

Optimizer states are a major source of memory consumption for training neural networks, limiting the maximum trainable model within given memory budget. Compressing the optimizer states from 32-bit floating points to lower bitwidth is…

Machine Learning · Computer Science 2023-10-30 Bingrui Li , Jianfei Chen , Jun Zhu

Memory-Efficient Adaptive Optimization

Adaptive gradient-based optimizers such as Adagrad and Adam are crucial for achieving state-of-the-art performance in machine translation and language modeling. However, these methods maintain second-order statistics for each parameter,…

Machine Learning · Computer Science 2019-09-13 Rohan Anil , Vineet Gupta , Tomer Koren , Yoram Singer

FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training

With the increase in the number of parameters in large language models, the process of pre-training and fine-tuning increasingly demands larger volumes of GPU memory. A significant portion of this memory is typically consumed by the…

Machine Learning · Computer Science 2025-08-15 Philip Zmushko , Aleksandr Beznosikov , Martin Takáč , Samuel Horváth

Memory-Efficient LLM Pretraining via Minimalist Optimizer Design

Training large language models (LLMs) relies on adaptive optimizers such as Adam, which introduce extra operations and require significantly more memory to maintain first- and second-order moments than SGD. While recent works such as…

Machine Learning · Computer Science 2026-05-22 Athanasios Glentis , Jiaxiang Li , Andi Han , Mingyi Hong

AdaPM: a Partial Momentum Algorithm for LLM Training

In the training of large language models, momentum is widely used and often demonstrated to achieve significant acceleration. However, storing momentum typically presents memory challenges. In this paper, we propose AdaPM, an adaptive…

Machine Learning · Computer Science 2025-10-13 Yimu Zhang , Yuanshi Liu , Cong Fang

Adaptive Memory Momentum via a Model-Based Framework for Deep Learning Optimization

The vast majority of modern deep learning models are trained with momentum-based first-order optimizers. The momentum term governs the optimizer's memory by determining how much each past gradient contributes to the current convergence…

Machine Learning · Computer Science 2026-05-12 Kristi Topollai , Anna Choromanska

Stable and low-precision training for large-scale vision-language models

We introduce new methods for 1) accelerating and 2) stabilizing training for large language-vision models. 1) For acceleration, we introduce SwitchBack, a linear layer for int8 quantized training which provides a speed-up of 13-25% while…

Machine Learning · Computer Science 2023-10-18 Mitchell Wortsman , Tim Dettmers , Luke Zettlemoyer , Ari Morcos , Ali Farhadi , Ludwig Schmidt

LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics

We introduce LDAdam, a memory-efficient optimizer for training large models, that performs adaptive optimization steps within lower dimensional subspaces, while consistently exploring the full parameter space during training. This strategy…

Machine Learning · Computer Science 2025-03-04 Thomas Robert , Mher Safaryan , Ionut-Vlad Modoranu , Dan Alistarh

An Analysis of Optimizer Choice on Energy Efficiency and Performance in Neural Network Training

As machine learning models grow increasingly complex and computationally demanding, understanding the environmental impact of training decisions becomes critical for sustainable AI development. This paper presents a comprehensive empirical…

Machine Learning · Computer Science 2025-09-18 Tom Almog

Adam-mini: Use Fewer Learning Rates To Gain More

We propose Adam-mini, an optimizer that achieves on par or better performance than AdamW with 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., $1/\sqrt{v}$). By investigating the…

Machine Learning · Computer Science 2025-02-25 Yushun Zhang , Congliang Chen , Ziniu Li , Tian Ding , Chenwei Wu , Diederik P. Kingma , Yinyu Ye , Zhi-Quan Luo , Ruoyu Sun

FOAM: Blocked State Folding for Memory-Efficient LLM Training

Large language models (LLMs) have demonstrated remarkable performance due to their large parameter counts and extensive training data. However, their scale leads to significant memory bottlenecks during training, especially when using…

Machine Learning · Computer Science 2026-05-14 Ziqing Wen , Jiahuan Wang , Ping Luo , Dongsheng Li , Tao Sun

Extreme Tensoring for Low-Memory Preconditioning

State-of-the-art models are now trained with billions of parameters, reaching hardware limits in terms of memory consumption. This has created a recent demand for memory-efficient optimizers. To this end, we investigate the limits and…

Machine Learning · Computer Science 2019-02-14 Xinyi Chen , Naman Agarwal , Elad Hazan , Cyril Zhang , Yi Zhang

Practical tradeoffs between memory, compute, and performance in learned optimizers

Optimization plays a costly and crucial role in developing machine learning systems. In learned optimizers, the few hyperparameters of commonly used hand-designed optimizers, e.g. Adam or SGD, are replaced with flexible parametric…

Machine Learning · Computer Science 2022-07-19 Luke Metz , C. Daniel Freeman , James Harrison , Niru Maheswaranathan , Jascha Sohl-Dickstein

CAME: Confidence-guided Adaptive Memory Efficient Optimization

Adaptive gradient methods, such as Adam and LAMB, have demonstrated excellent performance in the training of large language models. Nevertheless, the need for adaptivity requires maintaining second-moment estimates of the per-parameter…

Computation and Language · Computer Science 2023-08-08 Yang Luo , Xiaozhe Ren , Zangwei Zheng , Zhuo Jiang , Xin Jiang , Yang You

APOLLO: SGD-like Memory, AdamW-level Performance

Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training…

Machine Learning · Computer Science 2025-02-18 Hanqing Zhu , Zhenyu Zhang , Wenyan Cong , Xi Liu , Sem Park , Vikas Chandra , Bo Long , David Z. Pan , Zhangyang Wang , Jinwon Lee

FP8-LM: Training FP8 Large Language Models

In this paper, we explore FP8 low-bit data formats for efficient training of large language models (LLMs). Our key insight is that most variables, such as gradients and optimizer states, in LLM training can employ low-precision data formats…

Machine Learning · Computer Science 2023-12-20 Houwen Peng , Kan Wu , Yixuan Wei , Guoshuai Zhao , Yuxiang Yang , Ze Liu , Yifan Xiong , Ziyue Yang , Bolin Ni , Jingcheng Hu , Ruihang Li , Miaosen Zhang , Chen Li , Jia Ning , Ruizhe Wang , Zheng Zhang , Shuguang Liu , Joe Chau , Han Hu , Peng Cheng

Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers

Training large language models requires optimization algorithms that are not only statistically effective, but also computationally and memory efficient at extreme scale. Although Adam remains the dominant optimizer for large-scale…

Machine Learning · Computer Science 2026-05-12 Aditya Ranganath

Training neural networks faster with minimal tuning using pre-computed lists of hyperparameters for NAdamW

If we want to train a neural network using any of the most popular optimization algorithms, we are immediately faced with a dilemma: how to set the various optimization and regularization hyperparameters? When computational resources are…

Machine Learning · Computer Science 2025-03-07 Sourabh Medapati , Priya Kasimbeg , Shankar Krishnan , Naman Agarwal , George Dahl