Related papers: Efficient and Robust Parallel DNN Training through…

Efficient Pipeline Planning for Expedited Distributed DNN Training

To train modern large DNN models, pipeline parallelism has recently emerged, which distributes the model across GPUs and enables different devices to process different microbatches in pipeline. Earlier pipeline designs allow multiple…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-23 Ziyue Luo , Xiaodong Yi , Guoping Long , Shiqing Fan , Chuan Wu , Jun Yang , Wei Lin

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-03 Shiqing Fan , Yi Rong , Chen Meng , Zongyan Cao , Siyu Wang , Zhen Zheng , Chuan Wu , Guoping Long , Jun Yang , Lixue Xia , Lansong Diao , Xiaoyong Liu , Wei Lin

PipeDream: Fast and Efficient Pipeline Parallel DNN Training

PipeDream is a Deep Neural Network(DNN) training system for GPUs that parallelizes computation by pipelining execution across multiple machines. Its pipeline parallel computing model avoids the slowdowns faced by data-parallel training when…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-12 Aaron Harlap , Deepak Narayanan , Amar Phanishayee , Vivek Seshadri , Nikhil Devanur , Greg Ganger , Phil Gibbons

Experiments on Parallel Training of Deep Neural Network using Model Averaging

In this work we apply model averaging to parallel training of deep neural network (DNN). Parallelization is done in a model averaging manner. Data is partitioned and distributed to different nodes for local model updates, and model…

Machine Learning · Computer Science 2018-07-03 Hang Su , Haoyu Chen

XPipe: Efficient Pipeline Model Parallelism for Multi-GPU DNN Training

We propose XPipe, an efficient asynchronous pipeline model parallelism approach for multi-GPU DNN training. XPipe is designed to use multiple GPUs to concurrently and continuously train different parts of a DNN model. To improve GPU…

Machine Learning · Computer Science 2020-11-10 Lei Guan , Wotao Yin , Dongsheng Li , Xicheng Lu

Automatic Model Parallelism for Deep Neural Networks with Compiler and Hardware Support

The deep neural networks (DNNs) have been enormously successful in tasks that were hitherto in the human-only realm such as image recognition, and language translation. Owing to their success the DNNs are being explored for use in ever more…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-06-20 Sanket Tavarageri , Srinivas Sridharan , Bharat Kaul

GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

Deep neural networks (DNNs) continue to grow rapidly in size, making them infeasible to train on a single device. Pipeline parallelism is commonly used in existing DNN systems to support large-scale DNN training by partitioning a DNN into…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-29 Byungsoo Jeon , Mengdi Wu , Shiyi Cao , Sunghyun Kim , Sunghyun Park , Neeraj Aggarwal , Colin Unger , Daiyaan Arfeen , Peiyuan Liao , Xupeng Miao , Mohammad Alizadeh , Gregory R. Ganger , Tianqi Chen , Zhihao Jia

GNNPipe: Scaling Deep GNN Training with Pipelined Model Parallelism

Communication is a key bottleneck for distributed graph neural network (GNN) training. This paper proposes GNNPipe, a new approach that scales the distributed full-graph deep GNN training. Being the first to use layer-level model…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-26 Jingji Chen , Zhuoming Chen , Xuehai Qian

Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training

Deploying deep learning (DL) models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used…

Machine Learning · Computer Science 2022-11-08 Saptadeep Pal , Eiman Ebrahimi , Arslan Zulfiqar , Yaosheng Fu , Victor Zhang , Szymon Migacz , David Nellans , Puneet Gupta

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on…

Computation and Language · Computer Science 2021-08-25 Deepak Narayanan , Mohammad Shoeybi , Jared Casper , Patrick LeGresley , Mostofa Patwary , Vijay Anand Korthikanti , Dmitri Vainbrand , Prethvi Kashinkunti , Julie Bernauer , Bryan Catanzaro , Amar Phanishayee , Matei Zaharia

Parareal Neural Networks Emulating a Parallel-in-time Algorithm

As deep neural networks (DNNs) become deeper, the training time increases. In this perspective, multi-GPU parallel computing has become a key tool in accelerating the training of DNNs. In this paper, we introduce a novel methodology to…

Numerical Analysis · Mathematics 2024-07-08 Chang-Ock Lee , Youngkyu Lee , Jongho Park

PaSE: Parallelization Strategies for Efficient DNN Training

Training a deep neural network (DNN) requires substantial computational and memory requirements. It is common to use multiple devices to train a DNN to reduce the overall training time. There are several choices to parallelize each layer in…

Machine Learning · Computer Science 2024-07-08 Venmugil Elango

TiMePReSt: Time and Memory Efficient Pipeline Parallel DNN Training with Removed Staleness

DNN training is time-consuming and requires efficient multi-accelerator parallelization, where a single training iteration is split over available accelerators. Current approaches often parallelize training using intra-batch…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-24 Ankita Dutta , Nabendu Chaki , Rajat K. De

HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism

Deep Neural Network (DNN) models have continuously been growing in size in order to improve the accuracy and quality of the models. Moreover, for training of large DNN models, the use of heterogeneous GPUs is inevitable due to the short…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-05-29 Jay H. Park , Gyeongchan Yun , Chang M. Yi , Nguyen T. Nguyen , Seungmin Lee , Jaesik Choi , Sam H. Noh , Young-ri Choi

BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training

The size of deep neural networks (DNNs) grows rapidly as the complexity of the machine learning algorithm increases. To satisfy the requirement of computation and memory of DNN training, distributed deep learning based on model parallelism…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-01-15 Letian Zhao , Rui Xu , Tianqi Wang , Teng Tian , Xiaotian Wang , Wei Wu , Chio-in Ieong , Xi Jin

Hybrid Data-Model Parallel Training for Sequence-to-Sequence Recurrent Neural Network Machine Translation

Reduction of training time is an important issue in many tasks like patent translation involving neural networks. Data parallelism and model parallelism are two common approaches for reducing training time using multiple graphics processing…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-10 Junya Ono , Masao Utiyama , Eiichiro Sumita

Brief Announcement: On the Limits of Parallelizing Convolutional Neural Networks on GPUs

GPUs are currently the platform of choice for training neural networks. However, training a deep neural network (DNN) is a time-consuming process even on GPUs because of the massive number of parameters that have to be learned. As a result,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-05-29 Behnam Pourghassemi , Chenghao Zhang , Joo Hwan Lee , Aparna Chandramowlishwaran

DawnPiper: A Memory-scablable Pipeline Parallel Training Framework

Pipeline parallelism is a crucial paradigm for large-scale model training. However, imbalances in memory footprint across stages can lead to significant GPU memory wastage, limiting the model sizes that pipeline parallelism can effectively…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-12 Xuan Peng , Xuanhua Shi , Haolin Zhang , Yunfei Zhao , Xuehai Qian

Faster Multi-GPU Training with PPLL: A Pipeline Parallelism Framework Leveraging Local Learning

Currently, training large-scale deep learning models is typically achieved through parallel training across multiple GPUs. However, due to the inherent communication overhead and synchronization delays in traditional model parallelism…

Computer Vision and Pattern Recognition · Computer Science 2024-11-21 Xiuyuan Guo , Chengqi Xu , Guinan Guo , Feiyu Zhu , Changpeng Cai , Peizhe Wang , Xiaoming Wei , Junhao Su , Jialin Gao

Deep Learning Models on CPUs: A Methodology for Efficient Training

GPUs have been favored for training deep learning models due to their highly parallelized architecture. As a result, most studies on training optimization focus on GPUs. There is often a trade-off, however, between cost and efficiency when…

Machine Learning · Computer Science 2023-06-21 Quchen Fu , Ramesh Chukka , Keith Achorn , Thomas Atta-fosu , Deepak R. Canchi , Zhongwei Teng , Jules White , Douglas C. Schmidt