Related papers: A Memory Efficient Randomized Subspace Optimizatio…

Memory-Efficient LLM Pretraining via Minimalist Optimizer Design

Training large language models (LLMs) relies on adaptive optimizers such as Adam, which introduce extra operations and require significantly more memory to maintain first- and second-order moments than SGD. While recent works such as…

Machine Learning · Computer Science 2026-05-22 Athanasios Glentis , Jiaxiang Li , Andi Han , Mingyi Hong

AdaLomo: Low-memory Optimization with Adaptive Learning Rate

Large language models have achieved remarkable success, but their extensive parameter size necessitates substantial memory for training, thereby setting a high threshold. While the recently proposed low-memory optimization (LOMO) reduces…

Machine Learning · Computer Science 2024-06-07 Kai Lv , Hang Yan , Qipeng Guo , Haijun Lv , Xipeng Qiu

LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics

We introduce LDAdam, a memory-efficient optimizer for training large models, that performs adaptive optimization steps within lower dimensional subspaces, while consistently exploring the full parameter space during training. This strategy…

Machine Learning · Computer Science 2025-03-04 Thomas Robert , Mher Safaryan , Ionut-Vlad Modoranu , Dan Alistarh

Adaptive Optimization for Enhanced Efficiency in Large-Scale Language Model Training

With the rapid development of natural language processing technology, large-scale language models (LLM) have achieved remarkable results in a variety of tasks. However, how to effectively train these huge models and improve their…

Artificial Intelligence · Computer Science 2024-12-09 Jiajing Chen , Bingying Liu , Xiaoxuan Liao , Jia Gao , Hongye Zheng , Yue Li

Scalable Parameter and Memory Efficient Pretraining for LLM: Recent Algorithmic Advances and Benchmarking

Fueled by their remarkable ability to tackle diverse tasks across multiple domains, large language models (LLMs) have grown at an unprecedented rate, with some recent models containing trillions of parameters. This growth is accompanied by…

Machine Learning · Computer Science 2025-05-30 Athanasios Glentis , Jiaxiang Li , Qiulin Shang , Andi Han , Ioannis Tsaknakis , Quan Wei , Mingyi Hong

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable…

Machine Learning · Computer Science 2024-06-04 Jiawei Zhao , Zhenyu Zhang , Beidi Chen , Zhangyang Wang , Anima Anandkumar , Yuandong Tian

Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers

Training large language models requires optimization algorithms that are not only statistically effective, but also computationally and memory efficient at extreme scale. Although Adam remains the dominant optimizer for large-scale…

Machine Learning · Computer Science 2026-05-12 Aditya Ranganath

Subspace Optimization for Large Language Models with Convergence Guarantees

Subspace optimization algorithms, such as GaLore (Zhao et al., 2024), have gained attention for pre-training and fine-tuning large language models (LLMs) due to their memory efficiency. However, their convergence guarantees remain unclear,…

Machine Learning · Computer Science 2025-06-05 Yutong He , Pengrui Li , Yipeng Hu , Chuyan Chen , Kun Yuan

Zeroth-Order Fine-Tuning of LLMs in Random Subspaces

Fine-tuning Large Language Models (LLMs) has proven effective for a variety of downstream tasks. However, as LLMs grow in size, the memory demands for backpropagation become increasingly prohibitive. Zeroth-order (ZO) optimization methods…

Machine Learning · Computer Science 2025-07-25 Ziming Yu , Pan Zhou , Sike Wang , Jia Li , Mi Tian , Hua Huang

Lotus: Efficient LLM Training by Randomized Low-Rank Gradient Projection with Adaptive Subspace Switching

Training efficiency in large-scale models is typically assessed through memory consumption, training time, and model performance. Current methods often exhibit trade-offs among these metrics, as optimizing one generally degrades at least…

Machine Learning · Computer Science 2026-02-03 Tianhao Miao , Zhongyuan Bao , Lejun Zhang

BlockLLM: Memory-Efficient Adaptation of LLMs by Selecting and Optimizing the Right Coordinate Blocks

Training large language models (LLMs) for pretraining or adapting to new tasks and domains has become increasingly critical as their applications expand. However, as the model and the data sizes grow, the training process presents…

Machine Learning · Computer Science 2024-12-17 Amrutha Varshini Ramesh , Vignesh Ganapathiraman , Issam H. Laradji , Mark Schmidt

FOAM: Blocked State Folding for Memory-Efficient LLM Training

Large language models (LLMs) have demonstrated remarkable performance due to their large parameter counts and extensive training data. However, their scale leads to significant memory bottlenecks during training, especially when using…

Machine Learning · Computer Science 2026-05-14 Ziqing Wen , Jiahuan Wang , Ping Luo , Dongsheng Li , Tao Sun

GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection

Large language models (LLMs) have revolutionized natural language understanding and generation but face significant memory bottlenecks during training. GaLore, Gradient Low-Rank Projection, addresses this issue by leveraging the inherent…

Machine Learning · Computer Science 2025-04-30 DiJia Su , Andrew Gu , Jane Xu , Yuandong Tian , Jiawei Zhao

Reversing Large Language Models for Efficient Training and Fine-Tuning

Large Language Models (LLMs) are known for their expensive and time-consuming training. Thus, oftentimes, LLMs are fine-tuned to address a specific task, given the pretrained weights of a pre-trained LLM considered a foundation model. In…

Computation and Language · Computer Science 2025-12-05 Eshed Gal , Moshe Eliasof , Javier Turek , Uri Ascher , Eran Treister , Eldad Haber

SubTrack++ : Gradient Subspace Tracking for Scalable LLM Training

Training large language models (LLMs) is highly resource-intensive due to their massive number of parameters and the overhead of optimizer states. While recent work has aimed to reduce memory consumption, such efforts often entail…

Machine Learning · Computer Science 2025-10-28 Sahar Rajabi , Nayeema Nonta , Sirisha Rambhatla

Breaking the Frozen Subspace: Importance Sampling for Low-Rank Optimization in LLM Pretraining

Low-rank optimization has emerged as a promising approach to enabling memory-efficient training of large language models (LLMs). Existing low-rank optimization methods typically project gradients onto a low-rank subspace, reducing the…

Machine Learning · Computer Science 2025-12-15 Haochen Zhang , Junze Yin , Guanchu Wang , Zirui Liu , Lin F. Yang , Tianyi Zhang , Anshumali Shrivastava , Vladimir Braverman

Automated Optimization Modeling via a Localizable Error-Driven Perspective

Automated optimization modeling via Large Language Models (LLMs) has emerged as a promising approach to assist complex human decision-making. While post-training has become a pivotal technique to enhance LLMs' capabilities in this domain,…

Machine Learning · Computer Science 2026-02-13 Weiting Liu , Han Wu , Yufei Kuang , Xiongwei Han , Tao Zhong , Jianfeng Feng , Wenlian Lu

Fine-tuning Large Language Models with Limited Data: A Survey and Practical Guide

Fine-tuning large language models (LLMs) with limited data poses a practical challenge in low-resource languages, specialized domains, and constrained deployment settings. While pre-trained LLMs provide strong foundations, effective…

Computation and Language · Computer Science 2025-10-29 Marton Szep , Daniel Rueckert , Rüdiger von Eisenhart-Rothe , Florian Hinterwimmer

Lean and Mean Adaptive Optimization via Subset-Norm and Subspace-Momentum with Convergence Guarantees

We introduce two complementary techniques for efficient optimization that reduce memory requirements while accelerating training of large-scale neural networks. The first technique, Subset-Norm step size, generalizes AdaGrad-Norm and…

Machine Learning · Computer Science 2025-05-27 Thien Hang Nguyen , Huy Le Nguyen

Optimising Language Models for Downstream Tasks: A Post-Training Perspective

Language models (LMs) have demonstrated remarkable capabilities in NLP, yet adapting them efficiently and robustly to specific tasks remains challenging. As their scale and complexity grow, fine-tuning LMs on labelled data often…

Computation and Language · Computer Science 2025-06-27 Zhengyan Shi