English
Related papers

Related papers: ZeRO: Memory Optimizations Toward Training Trillio…

200 papers

In the last three years, the largest dense deep learning models have grown over 1000x to reach hundreds of billions of parameters, while the GPU memory has only grown by 5x (16 GB to 80 GB). Therefore, the growth in model scale has been…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-19 Samyam Rajbhandari , Olatunji Ruwase , Jeff Rasley , Shaden Smith , Yuxiong He

Large-scale model training has been a playing ground for a limited few requiring complex model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload changes the large model training landscape by making large model…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-01-19 Jie Ren , Samyam Rajbhandari , Reza Yazdani Aminabadi , Olatunji Ruwase , Shuangyan Yang , Minjia Zhang , Dong Li , Yuxiong He

Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of large language models on massive GPUs clusters due to its ease of use, efficiency, and good scalability. However, when training on low-bandwidth clusters, or at scale…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-21 Guanhua Wang , Heyang Qin , Sam Ade Jacobs , Connor Holmes , Samyam Rajbhandari , Olatunji Ruwase , Feng Yan , Lei Yang , Yuxiong He

AI accelerator processing capabilities and memory constraints largely dictate the scale in which machine learning workloads (e.g., training and inference) can be executed within a desirable time frame. Training a state of the art,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-12 Michael Benington , Leo Phan , Chris Pierre Paul , Evan Shoemaker , Priyanka Ranade , Torstein Collett , Grant Hodgson Perez , Christopher Krieger

Training large language models (LLMs) encounters challenges in GPU memory consumption due to the high memory requirements of model states. The widely used Zero Redundancy Optimizer (ZeRO) addresses this issue through strategic sharding but…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-14 Qiaoling Chen , Qinghao Hu , Guoteng Wang , Yingtong Xiong , Ting Huang , Xun Chen , Yang Gao , Hang Yan , Yonggang Wen , Tianwei Zhang , Peng Sun

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on…

Fine-tuning is powerful for adapting large language models to downstream tasks, but it often results in huge memory usages. A promising approach to mitigate this is using Zeroth-Order (ZO) optimization, which estimates gradients to replace…

Machine Learning · Computer Science 2024-10-15 Fei Wang , Li Shen , Liang Ding , Chao Xue , Ye Liu , Changxing Ding

Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using…

Machine Learning · Computer Science 2024-01-12 Sadhika Malladi , Tianyu Gao , Eshaan Nichani , Alex Damian , Jason D. Lee , Danqi Chen , Sanjeev Arora

Huge neural network models have shown unprecedented performance in real-world applications. However, due to memory constraints, model parallelism must be utilized to host large models that would otherwise not fit into the memory of a single…

Machine Learning · Computer Science 2021-04-13 Qifan Xu , Shenggui Li , Chaoyu Gong , Yang You

Training large deep learning models requires parallelization techniques to scale. In existing methods such as Data Parallelism or ZeRO-DP, micro-batches of data are processed in parallel, which creates two drawbacks: the total memory…

Machine Learning · Computer Science 2024-03-15 Louis Fournier , Edouard Oyallon

Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training. Lowering the threshold for LLMs training would encourage greater participation from researchers, benefiting…

Computation and Language · Computer Science 2024-06-07 Kai Lv , Yuqing Yang , Tengxiao Liu , Qinghui Gao , Qipeng Guo , Xipeng Qiu

Despite the dominance and effectiveness of scaling, resulting in large networks with hundreds of billions of parameters, the necessity to train overparameterized models remains poorly understood, while training costs grow exponentially. In…

Computation and Language · Computer Science 2023-12-12 Vladislav Lialin , Namrata Shivagunde , Sherin Muckatira , Anna Rumshisky

Fine-tuning large pre-trained LLMs generally demands extensive GPU memory. Traditional first-order optimizers like SGD encounter substantial difficulties due to increased memory requirements from storing activations and gradients during…

Machine Learning · Computer Science 2025-03-18 Liangyu Wang , Jie Ren , Hang Xu , Junxiao Wang , Huanyi Xie , David E. Keyes , Di Wang

Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory…

Computation and Language · Computer Science 2020-03-17 Mohammad Shoeybi , Mostofa Patwary , Raul Puri , Patrick LeGresley , Jared Casper , Bryan Catanzaro

The training efficiency and scalability of language models on massive clusters currently remain a critical bottleneck. Mainstream approaches like ND parallelism are often cumbersome and complex, while flexible alternatives such as the Zero…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-24 Huawei Bai , Yifan Huang , Wenqi Shi , Ansheng You , Feifan Shao , Tengfei Han , Minghui Yu

Fueled by their remarkable ability to tackle diverse tasks across multiple domains, large language models (LLMs) have grown at an unprecedented rate, with some recent models containing trillions of parameters. This growth is accompanied by…

Machine Learning · Computer Science 2025-05-30 Athanasios Glentis , Jiaxiang Li , Qiulin Shang , Andi Han , Ioannis Tsaknakis , Quan Wei , Mingyi Hong

Training extremely large language models (LLMs) with billions of parameters is a computationally intensive task that pushes the limits of current data parallel training systems. While techniques like ZeRO++ have enabled efficient…

Machine Learning · Computer Science 2024-10-08 Yun Dai , Tejas Dharamsi , Byron Hsu , Tao Song , Hamed Firooz

We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host…

Computation and Language · Computer Science 2026-04-08 Zhengqing Yuan , Hanchi Sun , Lichao Sun , Yanfang Ye

While fine-tuning large language models (LLMs) for specific tasks often yields impressive results, it comes at the cost of memory inefficiency due to back-propagation in gradient-based training. Memory-efficient Zeroth-order (MeZO)…

Machine Learning · Computer Science 2026-02-17 Yong Liu , Zirui Zhu , Chaoyu Gong , Minhao Cheng , Cho-Jui Hsieh , Yang You

The advent of the transformer has sparked a quick growth in the size of language models, far outpacing hardware improvements. (Dense) transformers are expected to reach the trillion-parameter scale in the near future, for which training…

Machine Learning · Computer Science 2021-06-08 Joel Lamy-Poirier
‹ Prev 1 2 3 10 Next ›