Related papers: ZeRO: Memory Optimizations Toward Training Trillio…

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

In the last three years, the largest dense deep learning models have grown over 1000x to reach hundreds of billions of parameters, while the GPU memory has only grown by 5x (16 GB to 80 GB). Therefore, the growth in model scale has been…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-19 Samyam Rajbhandari , Olatunji Ruwase , Jeff Rasley , Shaden Smith , Yuxiong He

ZeRO-Offload: Democratizing Billion-Scale Model Training

Large-scale model training has been a playing ground for a limited few requiring complex model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload changes the large model training landscape by making large model…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-01-19 Jie Ren , Samyam Rajbhandari , Reza Yazdani Aminabadi , Olatunji Ruwase , Shuangyan Yang , Minjia Zhang , Dong Li , Yuxiong He

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of large language models on massive GPUs clusters due to its ease of use, efficiency, and good scalability. However, when training on low-bandwidth clusters, or at scale…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-21 Guanhua Wang , Heyang Qin , Sam Ade Jacobs , Connor Holmes , Samyam Rajbhandari , Olatunji Ruwase , Feng Yan , Lei Yang , Yuxiong He

Scaling Studies for Efficient Parameter Search and Parallelism for Large Language Model Pre-training

AI accelerator processing capabilities and memory constraints largely dictate the scale in which machine learning workloads (e.g., training and inference) can be executed within a desirable time frame. Training a state of the art,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-12 Michael Benington , Leo Phan , Chris Pierre Paul , Evan Shoemaker , Priyanka Ranade , Torstein Collett , Grant Hodgson Perez , Christopher Krieger

AMSP: Reducing Communication Overhead of ZeRO for Efficient LLM Training

Training large language models (LLMs) encounters challenges in GPU memory consumption due to the high memory requirements of model states. The widely used Zero Redundancy Optimizer (ZeRO) addresses this issue through strategic sharding but…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-14 Qiaoling Chen , Qinghao Hu , Guoteng Wang , Yingtong Xiong , Ting Huang , Xun Chen , Yang Gao , Hang Yan , Yonggang Wen , Tianwei Zhang , Peng Sun

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on…

Computation and Language · Computer Science 2021-08-25 Deepak Narayanan , Mohammad Shoeybi , Jared Casper , Patrick LeGresley , Mostofa Patwary , Vijay Anand Korthikanti , Dmitri Vainbrand , Prethvi Kashinkunti , Julie Bernauer , Bryan Catanzaro , Amar Phanishayee , Matei Zaharia

Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models

Fine-tuning is powerful for adapting large language models to downstream tasks, but it often results in huge memory usages. A promising approach to mitigate this is using Zeroth-Order (ZO) optimization, which estimates gradients to replace…

Machine Learning · Computer Science 2024-10-15 Fei Wang , Li Shen , Liang Ding , Chao Xue , Ye Liu , Changxing Ding

Fine-Tuning Language Models with Just Forward Passes

Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using…

Machine Learning · Computer Science 2024-01-12 Sadhika Malladi , Tianyu Gao , Eshaan Nichani , Alex Damian , Jason D. Lee , Danqi Chen , Sanjeev Arora

An Efficient 2D Method for Training Super-Large Deep Learning Models

Huge neural network models have shown unprecedented performance in real-world applications. However, due to memory constraints, model parallelism must be utilized to host large models that would otherwise not fit into the memory of a single…

Machine Learning · Computer Science 2021-04-13 Qifan Xu , Shenggui Li , Chaoyu Gong , Yang You

Cyclic Data Parallelism for Efficient Parallelism of Deep Neural Networks

Training large deep learning models requires parallelization techniques to scale. In existing methods such as Data Parallelism or ZeRO-DP, micro-batches of data are processed in parallel, which creates two drawbacks: the total memory…

Machine Learning · Computer Science 2024-03-15 Louis Fournier , Edouard Oyallon

Full Parameter Fine-tuning for Large Language Models with Limited Resources

Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training. Lowering the threshold for LLMs training would encourage greater participation from researchers, benefiting…

Computation and Language · Computer Science 2024-06-07 Kai Lv , Yuqing Yang , Tengxiao Liu , Qinghui Gao , Qipeng Guo , Xipeng Qiu

ReLoRA: High-Rank Training Through Low-Rank Updates

Despite the dominance and effectiveness of scaling, resulting in large networks with hundreds of billions of parameters, the necessity to train overparameterized models remains poorly understood, while training costs grow exponentially. In…

Computation and Language · Computer Science 2023-12-12 Vladislav Lialin , Namrata Shivagunde , Sherin Muckatira , Anna Rumshisky

ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory

Fine-tuning large pre-trained LLMs generally demands extensive GPU memory. Traditional first-order optimizers like SGD encounter substantial difficulties due to increased memory requirements from storing activations and gradients during…

Machine Learning · Computer Science 2025-03-18 Liangyu Wang , Jie Ren , Hang Xu , Junxiao Wang , Huanyi Xie , David E. Keyes , Di Wang

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory…

Computation and Language · Computer Science 2020-03-17 Mohammad Shoeybi , Mostofa Patwary , Raul Puri , Patrick LeGresley , Jared Casper , Bryan Catanzaro

AsyncHZP: Hierarchical ZeRO Parallelism with Asynchronous Scheduling for Scalable LLM Training

The training efficiency and scalability of language models on massive clusters currently remain a critical bottleneck. Mainstream approaches like ND parallelism are often cumbersome and complex, while flexible alternatives such as the Zero…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-24 Huawei Bai , Yifan Huang , Wenqi Shi , Ansheng You , Feifan Shao , Tengfei Han , Minghui Yu

Scalable Parameter and Memory Efficient Pretraining for LLM: Recent Algorithmic Advances and Benchmarking

Fueled by their remarkable ability to tackle diverse tasks across multiple domains, large language models (LLMs) have grown at an unprecedented rate, with some recent models containing trillions of parameters. This growth is accompanied by…

Machine Learning · Computer Science 2025-05-30 Athanasios Glentis , Jiaxiang Li , Qiulin Shang , Andi Han , Ioannis Tsaknakis , Quan Wei , Mingyi Hong

Enhancing Stability for Large Language Models Training in Constrained Bandwidth Networks

Training extremely large language models (LLMs) with billions of parameters is a computationally intensive task that pushes the limits of current data parallel training systems. While techniques like ZeRO++ have enabled efficient…

Machine Learning · Computer Science 2024-10-08 Yun Dai , Tejas Dharamsi , Byron Hsu , Tao Song , Hamed Firooz

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host…

Computation and Language · Computer Science 2026-04-08 Zhengqing Yuan , Hanchi Sun , Lichao Sun , Yanfang Ye

Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning

While fine-tuning large language models (LLMs) for specific tasks often yields impressive results, it comes at the cost of memory inefficiency due to back-propagation in gradient-based training. Memory-efficient Zeroth-order (MeZO)…

Machine Learning · Computer Science 2026-02-17 Yong Liu , Zirui Zhu , Chaoyu Gong , Minhao Cheng , Cho-Jui Hsieh , Yang You

Layered gradient accumulation and modular pipeline parallelism: fast and efficient training of large language models

The advent of the transformer has sparked a quick growth in the size of language models, far outpacing hardware improvements. (Dense) transformers are expected to reach the trillion-parameter scale in the near future, for which training…

Machine Learning · Computer Science 2021-06-08 Joel Lamy-Poirier