Related papers: DynaPipe: Optimizing Multi-task Training through D…

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-03 Shiqing Fan , Yi Rong , Chen Meng , Zongyan Cao , Siyu Wang , Zhen Zheng , Chuan Wu , Guoping Long , Jun Yang , Lixue Xia , Lansong Diao , Xiaoyong Liu , Wei Lin

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism…

Machine Learning · Computer Science 2021-09-29 Zhuohan Li , Siyuan Zhuang , Shiyuan Guo , Danyang Zhuo , Hao Zhang , Dawn Song , Ion Stoica

DIP: Efficient Large Multimodal Model Training with Dynamic Interleaved Pipeline

Large multimodal models (LMMs) have demonstrated excellent capabilities in both understanding and generation tasks with various modalities. While these models can accept flexible combinations of input data, their training efficiency suffers…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-24 Zhenliang Xue , Hanpeng Hu , Xing Chen , Yimin Jiang , Yixin Song , Zeyu Mi , Yibo Zhu , Daxin Jiang , Yubin Xia , Haibo Chen

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Scaling up deep neural network capacity has been known as an effective approach to improving model quality for several different machine learning tasks. In many cases, increasing model capacity beyond the memory limit of a single…

Computer Vision and Pattern Recognition · Computer Science 2019-07-29 Yanping Huang , Youlong Cheng , Ankur Bapna , Orhan Firat , Mia Xu Chen , Dehao Chen , HyoukJoong Lee , Jiquan Ngiam , Quoc V. Le , Yonghui Wu , Zhifeng Chen

XPipe: Efficient Pipeline Model Parallelism for Multi-GPU DNN Training

We propose XPipe, an efficient asynchronous pipeline model parallelism approach for multi-GPU DNN training. XPipe is designed to use multiple GPUs to concurrently and continuously train different parts of a DNN model. To improve GPU…

Machine Learning · Computer Science 2020-11-10 Lei Guan , Wotao Yin , Dongsheng Li , Xicheng Lu

BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training

With the increasing scale of models, the need for efficient distributed training has become increasingly urgent. Recently, many synchronous pipeline parallelism approaches have been proposed to improve training throughput. However, these…

Machine Learning · Computer Science 2024-10-28 Houming Wu , Ling Chen , Wenjie Yu

Efficient Pipeline Planning for Expedited Distributed DNN Training

To train modern large DNN models, pipeline parallelism has recently emerged, which distributes the model across GPUs and enables different devices to process different microbatches in pipeline. Earlier pipeline designs allow multiple…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-23 Ziyue Luo , Xiaodong Yi , Guoping Long , Shiqing Fan , Chuan Wu , Jun Yang , Wei Lin

BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training

The size of deep neural networks (DNNs) grows rapidly as the complexity of the machine learning algorithm increases. To satisfy the requirement of computation and memory of DNN training, distributed deep learning based on model parallelism…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-01-15 Letian Zhao , Rui Xu , Tianqi Wang , Teng Tian , Xiaotian Wang , Wei Wu , Chio-in Ieong , Xi Jin

DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines

Diffusion models have emerged as dominant performers for image generation. To support training large diffusion models, this paper studies pipeline parallel training of diffusion models and proposes DiffusionPipe, a synchronous pipeline…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-03 Ye Tian , Zhen Jia , Ziyue Luo , Yida Wang , Chuan Wu

A Flexible Programmable Pipeline Parallelism Framework for Efficient DNN Training

Pipeline parallelism is an essential distributed parallelism method. Increasingly complex and diverse DNN models necessitate meticulously customized pipeline schedules for performance. However, existing practices typically rely on…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-10 Lijuan Jiang , Xingjian Qian , Zhenxiang Ma , Zan Zong , Hengjie Li , Chao Yang , Jidong Zhai

DawnPiper: A Memory-scablable Pipeline Parallel Training Framework

Pipeline parallelism is a crucial paradigm for large-scale model training. However, imbalances in memory footprint across stages can lead to significant GPU memory wastage, limiting the model sizes that pipeline parallelism can effectively…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-12 Xuan Peng , Xuanhua Shi , Haolin Zhang , Yunfei Zhao , Xuehai Qian

InfiniPipe: Elastic Pipeline Parallelism for Efficient Variable-Length Long-Context LLM Training

Long context training is crucial for LLM's context extension. Existing schemes, such as sequence parallelism, incur substantial communication overhead. Pipeline parallelism (PP) reduces this cost, but its effectiveness hinges on…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-28 Shiju Wang , Yujie Wang , Ao Sun , Fangcheng Fu , Zijian Zhu , Bin Cui , Xu Han , Kaisheng Ma

HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism

As transformer sequence lengths grow, existing pipeline parallelisms incur suboptimal performance due to the quadratic attention computation and the substantial memory overhead. To relieve these challenges, we propose HelixPipe, a novel…

Machine Learning · Computer Science 2025-07-02 Geng Zhang , Shenggan Cheng , Xuanlei Zhao , Ziming Liu , Yang You

SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training

Pipeline Parallelism (PP) serves as a crucial technique for training Large Language Models (LLMs), owing to its capability to alleviate memory pressure from model states with relatively low communication overhead. However, in long-context…

Machine Learning · Computer Science 2025-04-22 Zhouyang Li , Yuliang Liu , Wei Zhang , Tailing Yuan , Bin Chen , Chengru Song , Di Zhang

HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism

Deep Neural Network (DNN) models have continuously been growing in size in order to improve the accuracy and quality of the models. Moreover, for training of large DNN models, the use of heterogeneous GPUs is inevitable due to the short…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-05-29 Jay H. Park , Gyeongchan Yun , Chang M. Yi , Nguyen T. Nguyen , Seungmin Lee , Jaesik Choi , Sam H. Noh , Young-ri Choi

CrossPipe: Towards Optimal Pipeline Schedules for Cross-Datacenter Training

Training large language models (LLMs) now requires resources that exceed a single datacenter, making cross-datacenter strategies increasingly crucial. We present CrossPipe, a framework designed to optimize model training across…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-02 Tiancheng Chen , Ales Kubicek , Langwen Huang , Torsten Hoefler

LayerPipe: Accelerating Deep Neural Network Training by Intra-Layer and Inter-Layer Gradient Pipelining and Multiprocessor Scheduling

The time required for training the neural networks increases with size, complexity, and depth. Training model parameters by backpropagation inherently creates feedback loops. These loops hinder efficient pipelining and scheduling of the…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-30 Nanda K. Unnikrishnan , Keshab K. Parhi

FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism

Extending the context length (i.e., the maximum supported sequence length) of LLMs is of paramount significance. To facilitate long context training of LLMs, sequence parallelism has emerged as an essential technique, which scatters each…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-12 Yujie Wang , Shiju Wang , Shenhan Zhu , Fangcheng Fu , Xinyi Liu , Xuefeng Xiao , Huixia Li , Jiashi Li , Faming Wu , Bin Cui

NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining

Modern recommendation models have increased to trillions of parameters. As cluster scales expand to O(1k), distributed training bottlenecks shift from computation and memory to data movement, especially lookup and communication latency…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-09 Zhida Jiang , Zhaolong Xing , Huichao Chai , Tianxing Sun , Qiang Peng , Baopeng Yuan , Jiaxing Wang , Hua Du , Zhixin Wu , Xuemiao Li , Yikui Cao , Xinyu Liu , Yongxiang Feng , Zhen Chen , Ke Zhang

Distributed Training Large-Scale Deep Architectures

Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-09-21 Shang-Xuan Zou , Chun-Yen Chen , Jui-Lin Wu , Chun-Nan Chou , Chia-Chin Tsao , Kuan-Chieh Tung , Ting-Wei Lin , Cheng-Lung Sung , Edward Y. Chang