Related papers: DiffusionPipe: Training Large Diffusion Models wit…

BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training

With the increasing scale of models, the need for efficient distributed training has become increasingly urgent. Recently, many synchronous pipeline parallelism approaches have been proposed to improve training throughput. However, these…

Machine Learning · Computer Science 2024-10-28 Houming Wu , Ling Chen , Wenjie Yu

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-03 Shiqing Fan , Yi Rong , Chen Meng , Zongyan Cao , Siyu Wang , Zhen Zheng , Chuan Wu , Guoping Long , Jun Yang , Lixue Xia , Lansong Diao , Xiaoyong Liu , Wei Lin

BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training

The size of deep neural networks (DNNs) grows rapidly as the complexity of the machine learning algorithm increases. To satisfy the requirement of computation and memory of DNN training, distributed deep learning based on model parallelism…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-01-15 Letian Zhao , Rui Xu , Tianqi Wang , Teng Tian , Xiaotian Wang , Wei Wu , Chio-in Ieong , Xi Jin

Efficient Pipeline Planning for Expedited Distributed DNN Training

To train modern large DNN models, pipeline parallelism has recently emerged, which distributes the model across GPUs and enables different devices to process different microbatches in pipeline. Earlier pipeline designs allow multiple…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-23 Ziyue Luo , Xiaodong Yi , Guoping Long , Shiqing Fan , Chuan Wu , Jun Yang , Wei Lin

XPipe: Efficient Pipeline Model Parallelism for Multi-GPU DNN Training

We propose XPipe, an efficient asynchronous pipeline model parallelism approach for multi-GPU DNN training. XPipe is designed to use multiple GPUs to concurrently and continuously train different parts of a DNN model. To improve GPU…

Machine Learning · Computer Science 2020-11-10 Lei Guan , Wotao Yin , Dongsheng Li , Xicheng Lu

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism…

Machine Learning · Computer Science 2021-09-29 Zhuohan Li , Siyuan Zhuang , Shiyuan Guo , Danyang Zhuo , Hao Zhang , Dawn Song , Ion Stoica

DawnPiper: A Memory-scablable Pipeline Parallel Training Framework

Pipeline parallelism is a crucial paradigm for large-scale model training. However, imbalances in memory footprint across stages can lead to significant GPU memory wastage, limiting the model sizes that pipeline parallelism can effectively…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-12 Xuan Peng , Xuanhua Shi , Haolin Zhang , Yunfei Zhao , Xuehai Qian

Pipeline Parallelism for Inference on Heterogeneous Edge Computing

Deep neural networks with large model sizes achieve state-of-the-art results for tasks in computer vision (CV) and natural language processing (NLP). However, these large-scale models are too compute- or memory-intensive for…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-29 Yang Hu , Connor Imes , Xuanang Zhao , Souvik Kundu , Peter A. Beerel , Stephen P. Crago , John Paul N. Walters

DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines

Multi-task model training has been adopted to enable a single deep neural network model (often a large language model) to handle multiple tasks (e.g., question answering and text summarization). Multi-task training commonly receives input…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-20 Chenyu Jiang , Zhen Jia , Shuai Zheng , Yida Wang , Chuan Wu

A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training

Training diffusion models is always a computation-intensive task. In this paper, we introduce a novel speed-up method for diffusion model training, called, which is based on a closer look at time steps. Our key findings are: i) Time steps…

Machine Learning · Computer Science 2025-03-26 Kai Wang , Mingjia Shi , Yukun Zhou , Zekai Li , Zhihang Yuan , Yuzhang Shang , Xiaojiang Peng , Hanwang Zhang , Yang You

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Scaling up deep neural network capacity has been known as an effective approach to improving model quality for several different machine learning tasks. In many cases, increasing model capacity beyond the memory limit of a single…

Computer Vision and Pattern Recognition · Computer Science 2019-07-29 Yanping Huang , Youlong Cheng , Ankur Bapna , Orhan Firat , Mia Xu Chen , Dehao Chen , HyoukJoong Lee , Jiquan Ngiam , Quoc V. Le , Yonghui Wu , Zhifeng Chen

NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining

Modern recommendation models have increased to trillions of parameters. As cluster scales expand to O(1k), distributed training bottlenecks shift from computation and memory to data movement, especially lookup and communication latency…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-09 Zhida Jiang , Zhaolong Xing , Huichao Chai , Tianxing Sun , Qiang Peng , Baopeng Yuan , Jiaxing Wang , Hua Du , Zhixin Wu , Xuemiao Li , Yikui Cao , Xinyu Liu , Yongxiang Feng , Zhen Chen , Ke Zhang

HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism

Deep Neural Network (DNN) models have continuously been growing in size in order to improve the accuracy and quality of the models. Moreover, for training of large DNN models, the use of heterogeneous GPUs is inevitable due to the short…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-05-29 Jay H. Park , Gyeongchan Yun , Chang M. Yi , Nguyen T. Nguyen , Seungmin Lee , Jaesik Choi , Sam H. Noh , Young-ri Choi

Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models

Diffusion models are powerful, but they require a lot of time and data to train. We propose Patch Diffusion, a generic patch-wise training framework, to significantly reduce the training time costs while improving data efficiency, which…

Computer Vision and Pattern Recognition · Computer Science 2023-10-20 Zhendong Wang , Yifan Jiang , Huangjie Zheng , Peihao Wang , Pengcheng He , Zhangyang Wang , Weizhu Chen , Mingyuan Zhou

A Flexible Programmable Pipeline Parallelism Framework for Efficient DNN Training

Pipeline parallelism is an essential distributed parallelism method. Increasingly complex and diverse DNN models necessitate meticulously customized pipeline schedules for performance. However, existing practices typically rely on…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-10 Lijuan Jiang , Xingjian Qian , Zhenxiang Ma , Zan Zong , Hengjie Li , Chao Yang , Jidong Zhai

Zero Bubble Pipeline Parallelism

Pipeline parallelism is one of the key components for large-scale distributed training, yet its efficiency suffers from pipeline bubbles which were deemed inevitable. In this work, we introduce a scheduling strategy that, to our knowledge,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-01-22 Penghui Qi , Xinyi Wan , Guangxing Huang , Min Lin

PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference

This paper presents PipeFusion, an innovative parallel methodology to tackle the high latency issues associated with generating high-resolution images using diffusion transformers (DiTs) models. PipeFusion partitions images into patches and…

Computer Vision and Pattern Recognition · Computer Science 2026-05-05 Jiarui Fang , Jinzhe Pan , Aoyu Li , Xibo Sun , Jiannan Wang

Dynamical Diffusion: Learning Temporal Dynamics with Diffusion Models

Diffusion models have emerged as powerful generative frameworks by progressively adding noise to data through a forward process and then reversing this process to generate realistic samples. While these models have achieved strong…

Machine Learning · Computer Science 2025-03-04 Xingzhuo Guo , Yu Zhang , Baixu Chen , Haoran Xu , Jianmin Wang , Mingsheng Long

PipeDream: Fast and Efficient Pipeline Parallel DNN Training

PipeDream is a Deep Neural Network(DNN) training system for GPUs that parallelizes computation by pipelining execution across multiple machines. Its pipeline parallel computing model avoids the slowdowns faced by data-parallel training when…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-12 Aaron Harlap , Deepak Narayanan , Amar Phanishayee , Vivek Seshadri , Nikhil Devanur , Greg Ganger , Phil Gibbons

ProtoDiffusion: Classifier-Free Diffusion Guidance with Prototype Learning

Diffusion models are generative models that have shown significant advantages compared to other generative models in terms of higher generation quality and more stable training. However, the computational need for training diffusion models…

Computer Vision and Pattern Recognition · Computer Science 2023-07-06 Gulcin Baykal , Halil Faruk Karagoz , Taha Binhuraib , Gozde Unal