English
Related papers

Related papers: DeepCompile: A Compiler-Driven Approach to Optimiz…

200 papers

Distributed training using multiple devices (e.g., GPUs) has been widely adopted for learning DNN models over large datasets. However, the performance of large-scale distributed training tends to be far from linear speed-up in practice.…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-19 Hanpeng Hu , Chenyu Jiang , Yuchen Zhong , Yanghua Peng , Chuan Wu , Yibo Zhu , Haibin Lin , Chuanxiong Guo

Large-scale deep learning models contribute to significant performance improvements on varieties of downstream tasks. Current data and model parallelism approaches utilize model replication and partition techniques to support the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-19 Youhe Jiang , Fangcheng Fu , Xupeng Miao , Xiaonan Nie , Bin Cui

The time required for training the neural networks increases with size, complexity, and depth. Training model parameters by backpropagation inherently creates feedback loops. These loops hinder efficient pipelining and scheduling of the…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-30 Nanda K. Unnikrishnan , Keshab K. Parhi

Large-scale deep learning models contribute to significant performance improvements on varieties of downstream tasks. Current data and model parallelism approaches utilize model replication and partition techniques to support the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-22 Youhe Jiang , Fangcheng Fu , Xupeng Miao , Xiaonan Nie , Bin Cui

It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-03 Shiqing Fan , Yi Rong , Chen Meng , Zongyan Cao , Siyu Wang , Zhen Zheng , Chuan Wu , Guoping Long , Jun Yang , Lixue Xia , Lansong Diao , Xiaoyong Liu , Wei Lin

Distributed deep learning workloads include throughput-intensive training tasks on the GPU clusters, where the Distributed Stochastic Gradient Descent (SGD) incurs significant communication delays after backward propagation, forces workers…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-16 Cheng Luo , Lei Qu , Youshan Miao , Peng Cheng , Yongqiang Xiong

This paper provides an in-depth characterization of GPU-accelerated systems, to understand the interplay between overlapping computation and communication which is commonly employed in distributed training settings. Due to the large size of…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-08 Seonho Lee , Jihwan Oh , Junkyum Kim , Seokjin Go , Jongse Park , Divya Mahajan

Efficiently scaling deep neural networks across GPU clusters requires navigating complex trade-offs between computational throughput, memory utilization, and synchronization overhead. This paper presents a unified empirical evaluation of…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-06 Md Sultanul Islam Ovi

Distributed deep learning (DL) has become prevalent in recent years to reduce training time by leveraging multiple computing devices (e.g., GPUs/TPUs) due to larger models and datasets. However, system scalability is limited by…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-04 Zhenheng Tang , Shaohuai Shi , Wei Wang , Bo Li , Xiaowen Chu

This paper proposes DisCo, an automatic deep learning compilation module for data-parallel distributed training. Unlike most deep learning compilers that focus on training or inference on a single device, DisCo optimizes a DNN model for…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-27 Xiaodong Yi , Shiwei Zhang , Lansong Diao , Chuan Wu , Zhen Zheng , Shiqing Fan , Siyu Wang , Jun Yang , Wei Lin

Pipeline parallelism is a crucial paradigm for large-scale model training. However, imbalances in memory footprint across stages can lead to significant GPU memory wastage, limiting the model sizes that pipeline parallelism can effectively…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-12 Xuan Peng , Xuanhua Shi , Haolin Zhang , Yunfei Zhao , Xuehai Qian

Profile Guided Optimization (PGO) uses runtime profiling to direct compiler optimization decisions, effectively combining static analysis with actual execution behavior to enhance performance. Runtime profiles, collected through…

Performance · Computer Science 2025-07-23 Bingxin Liu , Yinghui Huang , Jianhua Gao , Jianjun Shi , Yongpeng Liu , Yipin Sun , Weixing Ji

Recent trends in high-performance computing and deep learning have led to the proliferation of studies on large-scale deep neural network training. However, the frequent communication requirements among computation nodes drastically slows…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-01-27 Shuo Ouyang , Dezun Dong , Yemao Xu , Liquan Xiao

The recent success of deep learning applications has coincided with those widely available powerful computational resources for training sophisticated machine learning models with huge datasets. Nonetheless, training large models such as…

Machine Learning · Computer Science 2022-01-03 Farley Lai , Asim Kadav , Erik Kruus

Deploying deep learning (DL) models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used…

Machine Learning · Computer Science 2022-11-08 Saptadeep Pal , Eiman Ebrahimi , Arslan Zulfiqar , Yaosheng Fu , Victor Zhang , Szymon Migacz , David Nellans , Puneet Gupta

With the rapid growth in the volume of data sets, models, and devices in the domain of deep learning, there is increasing attention on large-scale distributed deep learning. In contrast to traditional distributed deep learning, the…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-10 Feng Liang , Zhen Zhang , Haifeng Lu , Victor C. M. Leung , Yanyi Guo , Xiping Hu

The past few years have witnessed the flourishing of large-scale deep neural network models with ever-growing parameter numbers. Training such large-scale models typically requires massive memory and computing resources, necessitating…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-30 Yunze Wei , Tianshuo Hu , Cong Liang , Yong Cui

Recently, crossbar array based in-memory accelerators have been gaining interest due to their high throughput and energy efficiency. While software and compiler support for the in-memory accelerators has also been introduced, they are…

Hardware Architecture · Computer Science 2025-01-14 Jihoon Park , Jeongin Choe , Dohyun Kim , Jae-Joon Kim

The development of large-scale foundation models, particularly Large Language Models (LLMs), is constrained by significant computational and memory bottlenecks. These challenges elevate throughput optimization from a mere engineering task…

Machine Learning · Computer Science 2026-03-31 Mayank Jha

Recent years have seen an increase in the development of large deep learning (DL) models, which makes training efficiency crucial. Common practice is struggling with the trade-off between usability and performance. On one hand, DL…

Machine Learning · Computer Science 2023-12-27 Hongzheng Chen , Cody Hao Yu , Shuai Zheng , Zhen Zhang , Zhiru Zhang , Yida Wang
‹ Prev 1 2 3 10 Next ›