Related papers: DeepCompile: A Compiler-Driven Approach to Optimiz…

dPRO: A Generic Profiling and Optimization System for Expediting Distributed DNN Training

Distributed training using multiple devices (e.g., GPUs) has been widely adopted for learning DNN models over large datasets. However, the performance of large-scale distributed training tends to be far from linear speed-up in practice.…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-19 Hanpeng Hu , Chenyu Jiang , Yuchen Zhong , Yanghua Peng , Chuan Wu , Yibo Zhu , Haibin Lin , Chuanxiong Guo

OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning

Large-scale deep learning models contribute to significant performance improvements on varieties of downstream tasks. Current data and model parallelism approaches utilize model replication and partition techniques to support the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-19 Youhe Jiang , Fangcheng Fu , Xupeng Miao , Xiaonan Nie , Bin Cui

LayerPipe: Accelerating Deep Neural Network Training by Intra-Layer and Inter-Layer Gradient Pipelining and Multiprocessor Scheduling

The time required for training the neural networks increases with size, complexity, and depth. Training model parameters by backpropagation inherently creates feedback loops. These loops hinder efficient pipelining and scheduling of the…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-30 Nanda K. Unnikrishnan , Keshab K. Parhi

OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning

Large-scale deep learning models contribute to significant performance improvements on varieties of downstream tasks. Current data and model parallelism approaches utilize model replication and partition techniques to support the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-22 Youhe Jiang , Fangcheng Fu , Xupeng Miao , Xiaonan Nie , Bin Cui

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-03 Shiqing Fan , Yi Rong , Chen Meng , Zongyan Cao , Siyu Wang , Zhen Zheng , Chuan Wu , Guoping Long , Jun Yang , Lixue Xia , Lansong Diao , Xiaoyong Liu , Wei Lin

CrossoverScheduler: Overlapping Multiple Distributed Training Applications in a Crossover Manner

Distributed deep learning workloads include throughput-intensive training tasks on the GPU clusters, where the Distributed Stochastic Gradient Descent (SGD) incurs significant communication delays after backward propagation, forces workers…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-16 Cheng Luo , Lei Qu , Youshan Miao , Peng Cheng , Yongqiang Xiong

Characterizing Compute-Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implications

This paper provides an in-depth characterization of GPU-accelerated systems, to understand the interplay between overlapping computation and communication which is commonly employed in distributed training settings. Due to the large size of…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-08 Seonho Lee , Jihwan Oh , Junkyum Kim , Seokjin Go , Jongse Park , Divya Mahajan

Performance Characterization of Distributed Deep Learning Strategies: A Quantitative Evaluation of DDP, FSDP, and Parameter Server Architectures on GPU Clusters

Efficiently scaling deep neural networks across GPU clusters requires navigating complex trade-offs between computational throughput, memory utilization, and synchronization overhead. This paper presents a unified empirical evaluation of…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-06 Md Sultanul Islam Ovi

Communication-Efficient Distributed Deep Learning: A Comprehensive Survey

Distributed deep learning (DL) has become prevalent in recent years to reduce training time by leveraging multiple computing devices (e.g., GPUs/TPUs) due to larger models and datasets. However, system scalability is limited by…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-04 Zhenheng Tang , Shaohuai Shi , Wei Wang , Bo Li , Xiaowen Chu

Optimizing DNN Compilation for Distributed Training with Joint OP and Tensor Fusion

This paper proposes DisCo, an automatic deep learning compilation module for data-parallel distributed training. Unlike most deep learning compilers that focus on training or inference on a single device, DisCo optimizes a DNN model for…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-27 Xiaodong Yi , Shiwei Zhang , Lansong Diao , Chuan Wu , Zhen Zheng , Shiqing Fan , Siyu Wang , Jun Yang , Wei Lin

DawnPiper: A Memory-scablable Pipeline Parallel Training Framework

Pipeline parallelism is a crucial paradigm for large-scale model training. However, imbalances in memory footprint across stages can lead to significant GPU memory wastage, limiting the model sizes that pipeline parallelism can effectively…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-12 Xuan Peng , Xuanhua Shi , Haolin Zhang , Yunfei Zhao , Xuehai Qian

From Profiling to Optimization: Unveiling the Profile Guided Optimization

Profile Guided Optimization (PGO) uses runtime profiling to direct compiler optimization decisions, effectively combining static analysis with actual execution behavior to enhance performance. Runtime profiles, collected through…

Performance · Computer Science 2025-07-23 Bingxin Liu , Yinghui Huang , Jianhua Gao , Jianjun Shi , Yongpeng Liu , Yipin Sun , Weixing Ji

Communication optimization strategies for distributed deep neural network training: A survey

Recent trends in high-performance computing and deep learning have led to the proliferation of studies on large-scale deep neural network training. However, the frequent communication requirements among computation nodes drastically slows…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-01-27 Shuo Ouyang , Dezun Dong , Yemao Xu , Liquan Xiao

SplitBrain: Hybrid Data and Model Parallel Deep Learning

The recent success of deep learning applications has coincided with those widely available powerful computational resources for training sophisticated machine learning models with huge datasets. Nonetheless, training large models such as…

Machine Learning · Computer Science 2022-01-03 Farley Lai , Asim Kadav , Erik Kruus

Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training

Deploying deep learning (DL) models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used…

Machine Learning · Computer Science 2022-11-08 Saptadeep Pal , Eiman Ebrahimi , Arslan Zulfiqar , Yaosheng Fu , Victor Zhang , Szymon Migacz , David Nellans , Puneet Gupta

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey

With the rapid growth in the volume of data sets, models, and devices in the domain of deep learning, there is increasing attention on large-scale distributed deep learning. In contrast to traditional distributed deep learning, the…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-10 Feng Liang , Zhen Zhang , Haifeng Lu , Victor C. M. Leung , Yanyi Guo , Xiping Hu

Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities

The past few years have witnessed the flourishing of large-scale deep neural network models with ever-growing parameter numbers. Training such large-scale models typically requires massive memory and computing resources, necessitating…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-30 Yunze Wei , Tianshuo Hu , Cong Liang , Yong Cui

COMPASS: A Compiler Framework for Resource-Constrained Crossbar-Array Based In-Memory Deep Learning Accelerators

Recently, crossbar array based in-memory accelerators have been gaining interest due to their high throughput and energy efficiency. While software and compiler support for the in-memory accelerators has also been introduced, they are…

Hardware Architecture · Computer Science 2025-01-14 Jihoon Park , Jeongin Choe , Dohyun Kim , Jae-Joon Kim

Throughput Optimization as a Strategic Lever in Large-Scale AI Systems: Evidence from Dataloader and Memory Profiling Innovations

The development of large-scale foundation models, particularly Large Language Models (LLMs), is constrained by significant computational and memory bottlenecks. These challenges elevate throughput optimization from a mere engineering task…

Machine Learning · Computer Science 2026-03-31 Mayank Jha

Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training

Recent years have seen an increase in the development of large deep learning (DL) models, which makes training efficiency crucial. Common practice is struggling with the trade-off between usability and performance. On one hand, DL…

Machine Learning · Computer Science 2023-12-27 Hongzheng Chen , Cody Hao Yu , Shuai Zheng , Zhen Zhang , Zhiru Zhang , Yida Wang