Related papers: Dynamic Memory Based Adaptive Optimization

Memory-Efficient LLM Pretraining via Minimalist Optimizer Design

Training large language models (LLMs) relies on adaptive optimizers such as Adam, which introduce extra operations and require significantly more memory to maintain first- and second-order moments than SGD. While recent works such as…

Machine Learning · Computer Science 2026-05-22 Athanasios Glentis , Jiaxiang Li , Andi Han , Mingyi Hong

A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models

The memory challenges associated with training Large Language Models (LLMs) have become a critical concern, particularly when using the Adam optimizer. To address this issue, numerous memory-efficient techniques have been proposed, with…

Machine Learning · Computer Science 2025-02-12 Yiming Chen , Yuan Zhang , Yin Liu , Kun Yuan , Zaiwen Wen

Memory Augmented Optimizers for Deep Learning

Popular approaches for minimizing loss in data-driven learning often involve an abstraction or an explicit retention of the history of gradients for efficient parameter updates. The aggregated history of gradients nudges the parameter…

Machine Learning · Computer Science 2021-06-22 Paul-Aymeric McRae , Prasanna Parthasarathi , Mahmoud Assran , Sarath Chandar

Adaptive Sequential Optimization with Applications to Machine Learning

A framework is introduced for solving a sequence of slowly changing optimization problems, including those arising in regression and classification applications, using optimization algorithms such as stochastic gradient descent (SGD). The…

Machine Learning · Computer Science 2015-09-25 Craig Wilson , Venugopal V. Veeravalli

Practical tradeoffs between memory, compute, and performance in learned optimizers

Optimization plays a costly and crucial role in developing machine learning systems. In learned optimizers, the few hyperparameters of commonly used hand-designed optimizers, e.g. Adam or SGD, are replaced with flexible parametric…

Machine Learning · Computer Science 2022-07-19 Luke Metz , C. Daniel Freeman , James Harrison , Niru Maheswaranathan , Jascha Sohl-Dickstein

Adaptive Memory Momentum via a Model-Based Framework for Deep Learning Optimization

The vast majority of modern deep learning models are trained with momentum-based first-order optimizers. The momentum term governs the optimizer's memory by determining how much each past gradient contributes to the current convergence…

Machine Learning · Computer Science 2026-05-12 Kristi Topollai , Anna Choromanska

Should I try multiple optimizers when fine-tuning pre-trained Transformers for NLP tasks? Should I tune their hyperparameters?

NLP research has explored different neural model architectures and sizes, datasets, training objectives, and transfer learning techniques. However, the choice of optimizer during training has not been explored as extensively. Typically,…

Computation and Language · Computer Science 2024-02-13 Nefeli Gkouti , Prodromos Malakasiotis , Stavros Toumpis , Ion Androutsopoulos

When Can You Get Away with Low Memory Adam?

Adam is the go-to optimizer for training modern machine learning models, but it requires additional memory to maintain the moving averages of the gradients and their squares. While various low-memory optimizers have been proposed that…

Machine Learning · Computer Science 2025-03-19 Dayal Singh Kalra , John Kirchenbauer , Maissam Barkeshli , Tom Goldstein

Resetting the Optimizer in Deep RL: An Empirical Study

We focus on the task of approximating the optimal value function in deep reinforcement learning. This iterative process is comprised of solving a sequence of optimization problems where the loss function changes per iteration. The common…

Machine Learning · Computer Science 2023-11-16 Kavosh Asadi , Rasool Fakoor , Shoham Sabach

AdaCL:Adaptive Continual Learning

Class-Incremental Learning aims to update a deep classifier to learn new categories while maintaining or improving its accuracy on previously observed classes. Common methods to prevent forgetting previously learned classes include…

Machine Learning · Computer Science 2024-07-02 Elif Ceren Gok Yildirim , Murat Onur Yildirim , Mert Kilickaya , Joaquin Vanschoren

Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers

Training large language models requires optimization algorithms that are not only statistically effective, but also computationally and memory efficient at extreme scale. Although Adam remains the dominant optimizer for large-scale…

Machine Learning · Computer Science 2026-05-12 Aditya Ranganath

Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs

Reinforcement learning (RL), particularly RL from verifiable reward (RLVR), has become a crucial phase of training large language models (LLMs) and a key focus of current scaling efforts. However, optimization practices in RL largely follow…

Machine Learning · Computer Science 2026-02-25 Sagnik Mukherjee , Lifan Yuan , Pavan Jayasinha , Dilek Hakkani-Tür , Hao Peng

Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension

Designing efficient optimizers for large language models (LLMs) with low-memory requirements and fast convergence is an important and challenging problem. This paper makes a step towards the systematic design of such optimizers through the…

Machine Learning · Computer Science 2025-02-21 Wenbo Gong , Meyer Scetbon , Chao Ma , Edward Meeds

Adaptive Sequential Machine Learning

A framework previously introduced in [3] for solving a sequence of stochastic optimization problems with bounded changes in the minimizers is extended and applied to machine learning problems such as regression and classification. The…

Machine Learning · Computer Science 2019-04-08 Craig Wilson , Yuheng Bu , Venugopal Veeravalli

LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics

We introduce LDAdam, a memory-efficient optimizer for training large models, that performs adaptive optimization steps within lower dimensional subspaces, while consistently exploring the full parameter space during training. This strategy…

Machine Learning · Computer Science 2025-03-04 Thomas Robert , Mher Safaryan , Ionut-Vlad Modoranu , Dan Alistarh

Training With Data Dependent Dynamic Learning Rates

Recently many first and second order variants of SGD have been proposed to facilitate training of Deep Neural Networks (DNNs). A common limitation of these works stem from the fact that they use the same learning rate across all instances…

Machine Learning · Computer Science 2021-05-31 Shreyas Saxena , Nidhi Vyas , Dennis DeCoste

Lookahead Optimizer: k steps forward, 1 step back

The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate…

Machine Learning · Computer Science 2019-12-04 Michael R. Zhang , James Lucas , Geoffrey Hinton , Jimmy Ba

Dual Averaging is Surprisingly Effective for Deep Learning Optimization

First-order stochastic optimization methods are currently the most widely used class of methods for training deep neural networks. However, the choice of the optimizer has become an ad-hoc rule that can significantly affect the performance.…

Machine Learning · Computer Science 2020-10-21 Samy Jelassi , Aaron Defazio

Adaptive Optimization for Enhanced Efficiency in Large-Scale Language Model Training

With the rapid development of natural language processing technology, large-scale language models (LLM) have achieved remarkable results in a variety of tasks. However, how to effectively train these huge models and improve their…

Artificial Intelligence · Computer Science 2024-12-09 Jiajing Chen , Bingying Liu , Xiaoxuan Liao , Jia Gao , Hongye Zheng , Yue Li

How Memory in Optimization Algorithms Implicitly Modifies the Loss

In modern optimization methods used in deep learning, each update depends on the history of previous iterations, often referred to as memory, and this dependence decays fast as the iterates go further into the past. For example, gradient…

Machine Learning · Computer Science 2026-01-14 Matias D. Cattaneo , Boris Shigida