Related papers: Efficient Large-Scale Language Model Training on G…

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory…

Computation and Language · Computer Science 2020-03-17 Mohammad Shoeybi , Mostofa Patwary , Raul Puri , Patrick LeGresley , Jared Casper , Bryan Catanzaro

Optimizing Distributed Training on Frontier for Large Language Models

Large language models (LLMs) have demonstrated remarkable success as foundational models, benefiting various downstream applications through fine-tuning. Recent studies on loss scaling have demonstrated the superior performance of larger…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-25 Sajal Dash , Isaac Lyngaas , Junqi Yin , Xiao Wang , Romain Egele , Guojing Cong , Feiyi Wang , Prasanna Balaprakash

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings…

Machine Learning · Computer Science 2024-02-27 Ziheng Jiang , Haibin Lin , Yinmin Zhong , Qi Huang , Yangrui Chen , Zhi Zhang , Yanghua Peng , Xiang Li , Cong Xie , Shibiao Nong , Yulu Jia , Sun He , Hongmin Chen , Zhihao Bai , Qi Hou , Shipeng Yan , Ding Zhou , Yiyao Sheng , Zhuo Jiang , Haohan Xu , Haoran Wei , Zhang Zhang , Pengfei Nie , Leqi Zou , Sida Zhao , Liang Xiang , Zherui Liu , Zhe Li , Xiaoying Jia , Jianxi Ye , Xin Jin , Xin Liu

Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

Training Large Language Models(LLMs) is one of the most compute-intensive tasks in high-performance computing. Predicting end-to-end training time for multi-billion parameter models distributed across hundreds of GPUs remains challenging…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-30 Biyao Zhang , Mingkai Zheng , Debargha Ganguly , Xuecen Zhang , Vikash Singh , Vipin Chaudhary , Zhao Zhang

Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective

The rapid scaling of Large Language Models (LLMs) has pushed training workloads far beyond the limits of single-node analysis, demanding a deeper understanding of how these models behave across large-scale, multi-GPU systems. In this paper,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-22 Seokjin Go , Joongun Park , Spandan More , Hanjiang Wu , Irene Wang , Aaron Jezghani , Tushar Krishna , Divya Mahajan

Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training

The emergence of large language models (LLMs) relies heavily on distributed training strategies, among which pipeline parallelism plays a crucial role. As LLMs' training sequence length extends to 32k or even 128k, the current pipeline…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-12 Ao Sun , Weilin Zhao , Xu Han , Cheng Yang , Xinrong Zhang , Zhiyuan Liu , Chuan Shi , Maosong Sun

Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator

In large language model (LLM) training, several parallelization strategies, including Tensor Parallelism (TP), Pipeline Parallelism (PP), Data Parallelism (DP), as well as Sequence Parallelism (SP) and Context Parallelism (CP), are employed…

Machine Learning · Computer Science 2024-11-12 Kazuki Fujii , Kohei Watanabe , Rio Yokota

Galvatron: Automatic Distributed Training for Large Transformer Models

Training multi-billion to trillion-parameter language models efficiently on GPU clusters requires leveraging multiple parallelism strategies. We present Galvatron, a novel open-source framework (dubbed 'Optimus-Megatron' in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-08 Esmail Gumaan

Performance of Small Language Model Pretraining on FABRIC: An Empirical Study

Large language models (LLMs) require enormous computing power to pretrain on massive datasets. When limited datasets are available, smaller-sized LLMs are better choice to pretrain (on user-specified datasets) by following the scaling laws…

Machine Learning · Computer Science 2026-03-23 Praveen Rao

Characterizing Communication Patterns in Distributed Large Language Model Inference

Large Language Models (LLMs) built on transformer architectures have transformed natural language processing, achieving remarkable performance across diverse applications. While distributed inference frameworks enable practical deployment…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-22 Lang Xu , Kaushik Kandadi Suresh , Quentin Anthony , Nawras Alnaasan , Dhabaleswar K. Panda

Layered gradient accumulation and modular pipeline parallelism: fast and efficient training of large language models

The advent of the transformer has sparked a quick growth in the size of language models, far outpacing hardware improvements. (Dense) transformers are expected to reach the trillion-parameter scale in the near future, for which training…

Machine Learning · Computer Science 2021-06-08 Joel Lamy-Poirier

Scaling Studies for Efficient Parameter Search and Parallelism for Large Language Model Pre-training

AI accelerator processing capabilities and memory constraints largely dictate the scale in which machine learning workloads (e.g., training and inference) can be executed within a desirable time frame. Training a state of the art,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-12 Michael Benington , Leo Phan , Chris Pierre Paul , Evan Shoemaker , Priyanka Ranade , Torstein Collett , Grant Hodgson Perez , Christopher Krieger

Scaling Performance of Large Language Model Pretraining

Large language models (LLMs) show best-in-class performance across a wide range of natural language processing applications. Training these models is an extremely computationally expensive task; frontier Artificial Intelligence (AI)…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-10 Alexander Interrante-Grant , Carla Varela-Rosa , Suhaas Narayan , Chris Connelly , Albert Reuther

Breadth-First Pipeline Parallelism

We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-07-10 Joel Lamy-Poirier

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

The training process of Deep Neural Network (DNN) is compute-intensive, often taking days to weeks to train a DNN model. Therefore, parallel execution of DNN training on GPUs is a widely adopted approach to speed up the process nowadays.…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-29 Chi-Chung Chen , Chia-Lin Yang , Hsiang-Yun Cheng

ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment

The advent of the Transformer architecture has propelled the growth of natural language processing (NLP) models, leading to remarkable achievements in numerous NLP tasks. Yet, the absence of specialized hardware like expansive GPU memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-18 Xiaofeng Wu , Jia Rao , Wei Chen

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism…

Machine Learning · Computer Science 2021-09-29 Zhuohan Li , Siyuan Zhuang , Shiyuan Guo , Danyang Zhuo , Hao Zhang , Dawn Song , Ion Stoica

Maximizing Parallelism in Distributed Training for Huge Neural Networks

The recent Natural Language Processing techniques have been refreshing the state-of-the-art performance at an incredible speed. Training huge language models is therefore an imperative demand in both industry and academy. However, huge…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-01 Zhengda Bian , Qifan Xu , Boxiang Wang , Yang You

Research on Model Parallelism and Data Parallelism Optimization Methods in Large Language Model-Based Recommendation Systems

With the rapid adoption of large language models (LLMs) in recommendation systems, the computational and communication bottlenecks caused by their massive parameter sizes and large data volumes have become increasingly prominent. This paper…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-25 Haowei Yang , Yu Tian , Zhongheng Yang , Zhao Wang , Chengrui Zhou , Dannier Li

An Efficient 2D Method for Training Super-Large Deep Learning Models

Huge neural network models have shown unprecedented performance in real-world applications. However, due to memory constraints, model parallelism must be utilized to host large models that would otherwise not fit into the memory of a single…

Machine Learning · Computer Science 2021-04-13 Qifan Xu , Shenggui Li , Chaoyu Gong , Yang You