Related papers: Memory-Efficient Pipeline-Parallel DNN Training

PipeDream: Fast and Efficient Pipeline Parallel DNN Training

PipeDream is a Deep Neural Network(DNN) training system for GPUs that parallelizes computation by pipelining execution across multiple machines. Its pipeline parallel computing model avoids the slowdowns faced by data-parallel training when…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-12 Aaron Harlap , Deepak Narayanan , Amar Phanishayee , Vivek Seshadri , Nikhil Devanur , Greg Ganger , Phil Gibbons

DawnPiper: A Memory-scablable Pipeline Parallel Training Framework

Pipeline parallelism is a crucial paradigm for large-scale model training. However, imbalances in memory footprint across stages can lead to significant GPU memory wastage, limiting the model sizes that pipeline parallelism can effectively…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-12 Xuan Peng , Xuanhua Shi , Haolin Zhang , Yunfei Zhao , Xuehai Qian

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-03 Shiqing Fan , Yi Rong , Chen Meng , Zongyan Cao , Siyu Wang , Zhen Zheng , Chuan Wu , Guoping Long , Jun Yang , Lixue Xia , Lansong Diao , Xiaoyong Liu , Wei Lin

Efficient Pipeline Planning for Expedited Distributed DNN Training

To train modern large DNN models, pipeline parallelism has recently emerged, which distributes the model across GPUs and enables different devices to process different microbatches in pipeline. Earlier pipeline designs allow multiple…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-23 Ziyue Luo , Xiaodong Yi , Guoping Long , Shiqing Fan , Chuan Wu , Jun Yang , Wei Lin

PipeMare: Asynchronous Pipeline Parallel DNN Training

Pipeline parallelism (PP) when training neural networks enables larger models to be partitioned spatially, leading to both lower network communication and overall higher hardware utilization. Unfortunately, to preserve the statistical…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-11 Bowen Yang , Jian Zhang , Jonathan Li , Christopher Ré , Christopher R. Aberger , Christopher De Sa

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

The training process of Deep Neural Network (DNN) is compute-intensive, often taking days to weeks to train a DNN model. Therefore, parallel execution of DNN training on GPUs is a widely adopted approach to speed up the process nowadays.…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-29 Chi-Chung Chen , Chia-Lin Yang , Hsiang-Yun Cheng

2BP: 2-Stage Backpropagation

As Deep Neural Networks (DNNs) grow in size and complexity, they often exceed the memory capacity of a single accelerator, necessitating the sharding of model parameters across multiple accelerators. Pipeline parallelism is a commonly used…

Machine Learning · Computer Science 2024-05-29 Christopher Rae , Joseph K. L. Lee , James Richings

BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training

The size of deep neural networks (DNNs) grows rapidly as the complexity of the machine learning algorithm increases. To satisfy the requirement of computation and memory of DNN training, distributed deep learning based on model parallelism…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-01-15 Letian Zhao , Rui Xu , Tianqi Wang , Teng Tian , Xiaotian Wang , Wei Wu , Chio-in Ieong , Xi Jin

BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training

With the increasing scale of models, the need for efficient distributed training has become increasingly urgent. Recently, many synchronous pipeline parallelism approaches have been proposed to improve training throughput. However, these…

Machine Learning · Computer Science 2024-10-28 Houming Wu , Ling Chen , Wenjie Yu

TiMePReSt: Time and Memory Efficient Pipeline Parallel DNN Training with Removed Staleness

DNN training is time-consuming and requires efficient multi-accelerator parallelization, where a single training iteration is split over available accelerators. Current approaches often parallelize training using intra-batch…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-24 Ankita Dutta , Nabendu Chaki , Rajat K. De

AccEPT: An Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

It is usually infeasible to fit and train an entire large deep neural network (DNN) model using a single edge device due to the limited resources. To facilitate intelligent applications across edge devices, researchers have proposed…

Machine Learning · Computer Science 2023-11-13 Yuhao Chen , Yuxuan Yan , Qianqian Yang , Yuanchao Shu , Shibo He , Zhiguo Shi , Jiming Chen

Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines

Training large deep learning models at scale is very challenging. This paper proposes Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines for efficiently training large-scale models. Chimera is a synchronous…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-22 Shigang Li , Torsten Hoefler

Zero Bubble Pipeline Parallelism

Pipeline parallelism is one of the key components for large-scale distributed training, yet its efficiency suffers from pipeline bubbles which were deemed inevitable. In this work, we introduce a scheduling strategy that, to our knowledge,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-01-22 Penghui Qi , Xinyi Wan , Guangxing Huang , Min Lin

Pipeline Parallelism for Inference on Heterogeneous Edge Computing

Deep neural networks with large model sizes achieve state-of-the-art results for tasks in computer vision (CV) and natural language processing (NLP). However, these large-scale models are too compute- or memory-intensive for…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-29 Yang Hu , Connor Imes , Xuanang Zhao , Souvik Kundu , Peter A. Beerel , Stephen P. Crago , John Paul N. Walters

Pipelined Backpropagation at Scale: Training Large Models without Batches

New hardware can substantially increase the speed and efficiency of deep neural network training. To guide the development of future hardware architectures, it is pertinent to explore the hardware and machine learning properties of…

Machine Learning · Computer Science 2021-04-13 Atli Kosson , Vitaliy Chiley , Abhinav Venigalla , Joel Hestness , Urs Köster

OptPipe: Memory- and Scheduling-Optimized Pipeline Parallelism for LLM Training

Pipeline parallelism (PP) has become a standard technique for scaling large language model (LLM) training across multiple devices. However, despite recent progress in reducing memory consumption through activation offloading, existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-08 Hongpei Li , Han Zhang , Huikang Liu , Dongdong Ge , Yinyu Ye

Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency

Large-scale language models have become increasingly challenging and expensive to train. Among various methods addressing this issue, Pipeline Parallelism has been widely employed to accommodate massive model weights within limited GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-31 Ziming Liu , Shenggan Cheng , Haotian Zhou , Yang You

Breadth-First Pipeline Parallelism

We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-07-10 Joel Lamy-Poirier

XPipe: Efficient Pipeline Model Parallelism for Multi-GPU DNN Training

We propose XPipe, an efficient asynchronous pipeline model parallelism approach for multi-GPU DNN training. XPipe is designed to use multiple GPUs to concurrently and continuously train different parts of a DNN model. To improve GPU…

Machine Learning · Computer Science 2020-11-10 Lei Guan , Wotao Yin , Dongsheng Li , Xicheng Lu

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on…

Computation and Language · Computer Science 2021-08-25 Deepak Narayanan , Mohammad Shoeybi , Jared Casper , Patrick LeGresley , Mostofa Patwary , Vijay Anand Korthikanti , Dmitri Vainbrand , Prethvi Kashinkunti , Julie Bernauer , Bryan Catanzaro , Amar Phanishayee , Matei Zaharia