Related papers: Zero Bubble Pipeline Parallelism

Pipeline Parallelism with Controllable Memory

Pipeline parallelism has been widely explored, but most existing schedules lack a systematic methodology. In this paper, we propose a framework to decompose pipeline schedules as repeating a building block, and show that the lifespan of the…

Machine Learning · Computer Science 2024-11-05 Penghui Qi , Xinyi Wan , Nyamdavaa Amar , Min Lin

Breadth-First Pipeline Parallelism

We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-07-10 Joel Lamy-Poirier

Synergistic Tensor and Pipeline Parallelism

In the machine learning system, the hybrid model parallelism combining tensor parallelism (TP) and pipeline parallelism (PP) has become the dominant solution for distributed training of Large Language Models~(LLMs) and Multimodal LLMs…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-03 Mengshi Qi , Jiaxuan Peng , Jie Zhang , Juan Zhu , Yong Li , Huadong Ma

OptPipe: Memory- and Scheduling-Optimized Pipeline Parallelism for LLM Training

Pipeline parallelism (PP) has become a standard technique for scaling large language model (LLM) training across multiple devices. However, despite recent progress in reducing memory consumption through activation offloading, existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-08 Hongpei Li , Han Zhang , Huikang Liu , Dongdong Ge , Yinyu Ye

FreeRide: Harvesting Bubbles in Pipeline Parallelism

The occurrence of bubbles in pipeline parallelism is an inherent limitation that can account for more than 40% of the large language model (LLM) training time and is one of the main reasons for the underutilization of GPU resources in LLM…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-29 Jiashu Zhang , Zihan Pan , Molly , Xu , Khuzaima Daudjee , Sihang Liu

Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training

The emergence of large language models (LLMs) relies heavily on distributed training strategies, among which pipeline parallelism plays a crucial role. As LLMs' training sequence length extends to 32k or even 128k, the current pipeline…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-12 Ao Sun , Weilin Zhao , Xu Han , Cheng Yang , Xinrong Zhang , Zhiyuan Liu , Chuan Shi , Maosong Sun

Balancing Pipeline Parallelism with Vocabulary Parallelism

Pipeline parallelism is widely used to scale the training of transformer-based large language models, various works have been done to improve its throughput and memory footprint. In this paper, we address a frequently overlooked issue: the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-06 Man Tsung Yeung , Penghui Qi , Min Lin , Xinyi Wan

Efficient Pipeline Planning for Expedited Distributed DNN Training

To train modern large DNN models, pipeline parallelism has recently emerged, which distributes the model across GPUs and enables different devices to process different microbatches in pipeline. Earlier pipeline designs allow multiple…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-23 Ziyue Luo , Xiaodong Yi , Guoping Long , Shiqing Fan , Chuan Wu , Jun Yang , Wei Lin

SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training

Pipeline Parallelism (PP) serves as a crucial technique for training Large Language Models (LLMs), owing to its capability to alleviate memory pressure from model states with relatively low communication overhead. However, in long-context…

Machine Learning · Computer Science 2025-04-22 Zhouyang Li , Yuliang Liu , Wei Zhang , Tailing Yuan , Bin Chen , Chengru Song , Di Zhang

BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training

With the increasing scale of models, the need for efficient distributed training has become increasingly urgent. Recently, many synchronous pipeline parallelism approaches have been proposed to improve training throughput. However, these…

Machine Learning · Computer Science 2024-10-28 Houming Wu , Ling Chen , Wenjie Yu

DawnPiper: A Memory-scablable Pipeline Parallel Training Framework

Pipeline parallelism is a crucial paradigm for large-scale model training. However, imbalances in memory footprint across stages can lead to significant GPU memory wastage, limiting the model sizes that pipeline parallelism can effectively…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-12 Xuan Peng , Xuanhua Shi , Haolin Zhang , Yunfei Zhao , Xuehai Qian

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-03 Shiqing Fan , Yi Rong , Chen Meng , Zongyan Cao , Siyu Wang , Zhen Zheng , Chuan Wu , Guoping Long , Jun Yang , Lixue Xia , Lansong Diao , Xiaoyong Liu , Wei Lin

PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization

Pipeline parallelism (PP) is widely used for training large language models (LLMs), yet its scalability is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. In this…

Machine Learning · Computer Science 2025-07-01 Xinyi Wan , Penghui Qi , Guangxing Huang , Min Lin , Jialin Li

AdaPtis: Reducing Pipeline Bubbles with Adaptive Pipeline Parallelism on Heterogeneous Models

Pipeline parallelism is widely used to train large language models (LLMs). However, increasing heterogeneity in model architectures exacerbates pipeline bubbles, thereby reducing training efficiency. Existing approaches overlook the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-30 Jihu Guo , Tenghui Ma , Wei Gao , Peng Sun , Jiaxing Li , Xun Chen , Yuyang Jin , Dahua Lin

TimelyFreeze: Adaptive Parameter Freezing Mechanism for Pipeline Parallelism

Pipeline parallelism enables training models that exceed single-device memory, but practical throughput remains limited by pipeline bubbles. Although parameter freezing can improve training throughput by adaptively skipping backward…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-09 Seonghye Cho , Jaemin Han , Hyunjin Kim , Euisoo Jung , Jae-Gil Lee

Pipeflow: An Efficient Task-Parallel Pipeline Programming Framework using Modern C++

Pipeline is a fundamental parallel programming pattern. Mainstream pipeline programming frameworks count on data abstractions to perform pipeline scheduling. This design is convenient for data-centric pipeline applications but inefficient…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-03 Cheng-Hsiang Chiu , Tsung-Wei Huang , Zizheng Guo , Yibo Lin

Memory-Efficient Pipeline-Parallel DNN Training

Many state-of-the-art ML results have been obtained by scaling up the number of parameters in existing models. However, parameters and activations for such large models often do not fit in the memory of a single accelerator device; this…

Machine Learning · Computer Science 2021-07-23 Deepak Narayanan , Amar Phanishayee , Kaiyu Shi , Xie Chen , Matei Zaharia

Automatic Operator-level Parallelism Planning for Distributed Deep Learning -- A Mixed-Integer Programming Approach

As the artificial intelligence community advances into the era of large models with billions of parameters, distributed training and inference have become essential. While various parallelism strategies-data, model, sequence, and…

Machine Learning · Computer Science 2025-03-13 Ruifeng She , Bowen Pang , Kai Li , Zehua Liu , Tao Zhong

AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism

Data and pipeline parallelism are key strategies for scaling neural network training across distributed devices, but their high communication cost necessitates co-located computing clusters with fast interconnects, limiting their…

Machine Learning · Computer Science 2026-02-02 Thalaiyasingam Ajanthan , Sameera Ramasinghe , Gil Avraham , Hadi Mohaghegh Dolatabadi , Chamin P Hewa Koneputugodage , Violetta Shevchenko , Yan Zuo , Alexander Long

A Tabular Schedule Abstraction for Communication-Aware Evaluation of Pipeline-Parallel LLM Training

Pipeline parallelism is a key technique for distributed training of large language models because it reduces per-device parameter and activation memory. However, comparing pipeline schedules is difficult: analytical models expose structural…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-26 Daniel Barley , Jonathan Leis , Benjamin Klenk , Holger Fröning