Related papers: Seq1F1B: Efficient Sequence-Level Pipeline Paralle…

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on…

Computation and Language · Computer Science 2021-08-25 Deepak Narayanan , Mohammad Shoeybi , Jared Casper , Patrick LeGresley , Mostofa Patwary , Vijay Anand Korthikanti , Dmitri Vainbrand , Prethvi Kashinkunti , Julie Bernauer , Bryan Catanzaro , Amar Phanishayee , Matei Zaharia

Breadth-First Pipeline Parallelism

We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-07-10 Joel Lamy-Poirier

Zero Bubble Pipeline Parallelism

Pipeline parallelism is one of the key components for large-scale distributed training, yet its efficiency suffers from pipeline bubbles which were deemed inevitable. In this work, we introduce a scheduling strategy that, to our knowledge,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-01-22 Penghui Qi , Xinyi Wan , Guangxing Huang , Min Lin

SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training

Pipeline Parallelism (PP) serves as a crucial technique for training Large Language Models (LLMs), owing to its capability to alleviate memory pressure from model states with relatively low communication overhead. However, in long-context…

Machine Learning · Computer Science 2025-04-22 Zhouyang Li , Yuliang Liu , Wei Zhang , Tailing Yuan , Bin Chen , Chengru Song , Di Zhang

Pipeline Parallelism with Controllable Memory

Pipeline parallelism has been widely explored, but most existing schedules lack a systematic methodology. In this paper, we propose a framework to decompose pipeline schedules as repeating a building block, and show that the lifespan of the…

Machine Learning · Computer Science 2024-11-05 Penghui Qi , Xinyi Wan , Nyamdavaa Amar , Min Lin

SPPO:Efficient Long-sequence LLM Training via Adaptive Sequence Pipeline Parallel Offloading

In recent years, Large Language Models (LLMs) have exhibited remarkable capabilities, driving advancements in real-world applications. However, training LLMs on increasingly long input sequences imposes significant challenges due to high…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-14 Qiaoling Chen , Shenggui Li , Wei Gao , Peng Sun , Yonggang Wen , Tianwei Zhang

HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism

As transformer sequence lengths grow, existing pipeline parallelisms incur suboptimal performance due to the quadratic attention computation and the substantial memory overhead. To relieve these challenges, we propose HelixPipe, a novel…

Machine Learning · Computer Science 2025-07-02 Geng Zhang , Shenggan Cheng , Xuanlei Zhao , Ziming Liu , Yang You

Enhancing Memory Efficiency in Large Language Model Training Through Chronos-aware Pipeline Parallelism

Larger model sizes and longer sequence lengths have empowered the Large Language Model (LLM) to achieve outstanding performance across various domains. However, this progress brings significant storage capacity challenges for LLM…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-26 Xinyuan Lin , Chenlu Li , Zongle Huang , Chunyu Wang , Bo Xiao , Huazhong Yang , Shishi Duan , Yongpan Liu

Efficient Long Context Fine-tuning with Chunk Flow

Long context fine-tuning of large language models(LLMs) involves training on datasets that are predominantly composed of short sequences and a small proportion of longer sequences. However, existing approaches overlook this long-tail…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-14 Xiulong Yuan , Hongtao Xu , Wenting Shen , Ang Wang , Xiafei Qiu , Jie Zhang , Yuqiong Liu , Bowen Yu , Junyang Lin , Mingzhen Li , Weile Jia , Yong Li , Wei Lin

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Computation in a typical Transformer-based large language model (LLM) can be characterized by batch size, hidden dimension, number of layers, and sequence length. Until now, system works for accelerating LLM training have focused on the…

Machine Learning · Computer Science 2023-10-05 Sam Ade Jacobs , Masahiro Tanaka , Chengming Zhang , Minjia Zhang , Shuaiwen Leon Song , Samyam Rajbhandari , Yuxiong He

Efficient Parallelization Layouts for Large-Scale Distributed Model Training

Efficiently training large language models requires parallelizing across hundreds of hardware accelerators and invoking various compute and memory optimizations. When combined, many of these strategies have complex interactions regarding…

Machine Learning · Computer Science 2024-09-25 Johannes Hagemann , Samuel Weinbach , Konstantin Dobler , Maximilian Schall , Gerard de Melo

AdaPtis: Reducing Pipeline Bubbles with Adaptive Pipeline Parallelism on Heterogeneous Models

Pipeline parallelism is widely used to train large language models (LLMs). However, increasing heterogeneity in model architectures exacerbates pipeline bubbles, thereby reducing training efficiency. Existing approaches overlook the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-30 Jihu Guo , Tenghui Ma , Wei Gao , Peng Sun , Jiaxing Li , Xun Chen , Yuyang Jin , Dahua Lin

SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM

While Large Language Models (LLMs) have achieved remarkable success in various fields, the efficiency of training and inference remains a major challenge. To address this issue, we propose SUBLLM, short for Subsampling-Upsampling-Bypass…

Computation and Language · Computer Science 2024-08-26 Quandong Wang , Yuxuan Yuan , Xiaoyu Yang , Ruike Zhang , Kang Zhao , Wei Liu , Jian Luan , Daniel Povey , Bin Wang

FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism

Extending the context length (i.e., the maximum supported sequence length) of LLMs is of paramount significance. To facilitate long context training of LLMs, sequence parallelism has emerged as an essential technique, which scatters each…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-12 Yujie Wang , Shiju Wang , Shenhan Zhu , Fangcheng Fu , Xinyi Liu , Xuefeng Xiao , Huixia Li , Jiashi Li , Faming Wu , Bin Cui

OptPipe: Memory- and Scheduling-Optimized Pipeline Parallelism for LLM Training

Pipeline parallelism (PP) has become a standard technique for scaling large language model (LLM) training across multiple devices. However, despite recent progress in reducing memory consumption through activation offloading, existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-08 Hongpei Li , Han Zhang , Huikang Liu , Dongdong Ge , Yinyu Ye

Balancing Pipeline Parallelism with Vocabulary Parallelism

Pipeline parallelism is widely used to scale the training of transformer-based large language models, various works have been done to improve its throughput and memory footprint. In this paper, we address a frequently overlooked issue: the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-06 Man Tsung Yeung , Penghui Qi , Min Lin , Xinyi Wan

Synergistic Tensor and Pipeline Parallelism

In the machine learning system, the hybrid model parallelism combining tensor parallelism (TP) and pipeline parallelism (PP) has become the dominant solution for distributed training of Large Language Models~(LLMs) and Multimodal LLMs…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-03 Mengshi Qi , Jiaxuan Peng , Jie Zhang , Juan Zhu , Yong Li , Huadong Ma

Ada-Grouper: Accelerating Pipeline Parallelism in Preempted Network by Adaptive Group-Scheduling for Micro-Batches

Pipeline parallelism has been demonstrated to be a remarkable approach to improve throughput for training deep neural networks with billions of parameters over heterogeneous clusters. The 1F1B scheduling plan is a widely adopted strategy…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-03-06 Siyu Wang , Zongyan Cao , Chang Si , Lansong Diao , Jiamang Wang , Wei Lin

FFN Fusion: Rethinking Sequential Computation in Large Language Models

We introduce FFN Fusion, an architectural optimization technique that reduces sequential computation in large language models by identifying and exploiting natural opportunities for parallelization. Our key insight is that sequences of…

Machine Learning · Computer Science 2025-03-25 Akhiad Bercovich , Mohammad Dabbah , Omri Puny , Ido Galil , Amnon Geifman , Yonatan Geifman , Izhak Golan , Ehud Karpas , Itay Levy , Zach Moshe , Najeeb Nabwani , Tomer Ronen , Itamar Schen , Elad Segal , Ido Shahaf , Oren Tropp , Ran Zilberstein , Ran El-Yaniv

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

Large Language Models (LLMs) have seen great advance in both academia and industry, and their popularity results in numerous open-source frameworks and techniques in accelerating LLM pre-training, fine-tuning, and inference. Training and…

Performance · Computer Science 2023-12-04 Longteng Zhang , Xiang Liu , Zeyu Li , Xinglin Pan , Peijie Dong , Ruibo Fan , Rui Guo , Xin Wang , Qiong Luo , Shaohuai Shi , Xiaowen Chu