Related papers: Characterizing and Modeling Distributed Training w…

Speeding up Deep Learning with Transient Servers

Distributed training frameworks, like TensorFlow, have been proposed as a means to reduce the training time of deep learning models by using a cluster of GPU servers. While such speedups are often desirable---e.g., for rapidly evaluating…

Performance · Computer Science 2019-05-07 Shijian Li , Robert J. Walls , Lijie Xu , Tian Guo

Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-21 Shaohuai Shi , Xianhao Zhou , Shutao Song , Xingyao Wang , Zilin Zhu , Xue Huang , Xinan Jiang , Feihu Zhou , Zhenyu Guo , Liqiang Xie , Rui Lan , Xianbin Ouyang , Yan Zhang , Jieqian Wei , Jing Gong , Weiliang Lin , Ping Gao , Peng Meng , Xiaomin Xu , Chenyang Guo , Bo Yang , Zhibo Chen , Yongjian Wu , Xiaowen Chu

Hyper: Distributed Cloud Processing for Large-Scale Deep Learning Tasks

Training and deploying deep learning models in real-world applications require processing large amounts of data. This is a challenging task when the amount of data grows to a hundred terabytes, or even, petabyte-scale. We introduce a hybrid…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-17 Davit Buniatyan

Scalable, Distributed AI Frameworks: Leveraging Cloud Computing for Enhanced Deep Learning Performance and Efficiency

In recent years, the integration of artificial intelligence (AI) and cloud computing has emerged as a promising avenue for addressing the growing computational demands of AI applications. This paper presents a comprehensive study of…

Machine Learning · Computer Science 2023-04-28 Neelesh Mungoli

A Hitchhiker's Guide On Distributed Training of Deep Neural Networks

Deep learning has led to tremendous advancements in the field of Artificial Intelligence. One caveat however is the substantial amount of compute needed to train these deep learning models. Training a benchmark dataset like ImageNet on a…

Machine Learning · Computer Science 2018-10-30 Karanbir Chahal , Manraj Singh Grover , Kuntal Dey

DeepCEE: Efficient Cross-Region Model Distributed Training System under Heterogeneous GPUs and Networks

Most existing training systems focus on a single region. In contrast, we envision that cross-region training offers more flexible GPU resource allocation and yields significant potential. However, the hierarchical cluster topology and…

Systems and Control · Electrical Eng. & Systems 2025-05-28 Jinquan Wang , Xiaojian Liao , Xuzhao Liu , Jiashun Suo , Zhisheng Huo , Chenhao Zhang , Xiangrong Xu , Runnan Shen , Xilong Xie , Limin Xiao

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

Deep learning frameworks have been widely deployed on GPU servers for deep learning applications in both academia and industry. In training deep neural networks (DNNs), there are many standard processes or algorithms, such as convolution…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-08-21 Shaohuai Shi , Qiang Wang , Xiaowen Chu

A Study on the Performance of Distributed Training of Data-driven CFD Simulations

Data-driven methods for computer simulations are blooming in many scientific areas. The traditional approach to simulating physical behaviors relies on solving partial differential equations (PDE). Since calculating these iterative…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-01 Sergio Iserte , Alejandro González-Barberá , Paloma Barreda , Krzysztof Rojek

Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training

Distributed training of deep nets is an important technique to address some of the present day computing challenges like memory consumption and computational demands. Classical distributed approaches, synchronous or asynchronous, are based…

Machine Learning · Computer Science 2019-01-14 Youjie Li , Mingchao Yu , Songze Li , Salman Avestimehr , Nam Sung Kim , Alexander Schwing

Characterizing and Understanding Distributed GNN Training on GPUs

Graph neural network (GNN) has been demonstrated to be a powerful model in many domains for its effectiveness in learning over graphs. To scale GNN training for large graphs, a widely adopted approach is distributed training which…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-04-19 Haiyang Lin , Mingyu Yan , Xiaocheng Yang , Mo Zou , Wenming Li , Xiaochun Ye , Dongrui Fan

Theano-MPI: a Theano-based Distributed Training Framework

We develop a scalable and extendable training framework that can utilize GPUs across nodes in a cluster and accelerate the training of deep learning models based on data parallelism. Both synchronous and asynchronous training are…

Machine Learning · Computer Science 2016-05-27 He Ma , Fei Mao , Graham W. Taylor

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-27 Zhongyi Lin , Ning Sun , Pallab Bhattacharya , Xizhou Feng , Louis Feng , John D. Owens

Quantifying and Improving Performance of Distributed Deep Learning with Cloud Storage

Cloud computing provides a powerful yet low-cost environment for distributed deep learning workloads. However, training complex deep learning models often requires accessing large amounts of data, which can easily exceed the capacity of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-24 Nicholas Krichevsky , Renee St Louis , Tian Guo

Scalable Balanced Training of Conditional Generative Adversarial Neural Networks on Image Data

We propose a distributed approach to train deep convolutional generative adversarial neural network (DC-CGANs) models. Our method reduces the imbalance between generator and discriminator by partitioning the training data according to data…

Computer Vision and Pattern Recognition · Computer Science 2021-04-30 Massimiliano Lupo Pasini , Vittorio Gabbi , Junqi Yin , Simona Perotto , Nouamane Laanait

Cephalo: Harnessing Heterogeneous GPU Clusters for Training Transformer Models

Training transformer models requires substantial GPU compute and memory resources. In homogeneous clusters, distributed strategies allocate resources evenly, but this approach is inefficient for heterogeneous clusters, where GPUs differ in…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-15 Runsheng Benson Guo , Utkarsh Anand , Arthur Chen , Khuzaima Daudjee

G-Meta: Distributed Meta Learning in GPU Clusters for Large-Scale Recommender Systems

Recently, a new paradigm, meta learning, has been widely applied to Deep Learning Recommendation Models (DLRM) and significantly improves statistical performance, especially in cold-start scenarios. However, the existing systems are not…

Machine Learning · Computer Science 2024-04-16 Youshao Xiao , Shangchun Zhao , Zhenglei Zhou , Zhaoxin Huan , Lin Ju , Xiaolu Zhang , Lin Wang , Jun Zhou

Decentralized Diffusion Models

Large-scale AI model training divides work across thousands of GPUs, then synchronizes gradients across them at each step. This incurs a significant network burden that only centralized, monolithic clusters can support, driving up…

Computer Vision and Pattern Recognition · Computer Science 2025-01-13 David McAllister , Matthew Tancik , Jiaming Song , Angjoo Kanazawa

Training Distributed Deep Recurrent Neural Networks with Mixed Precision on GPU Clusters

In this paper, we evaluate training of deep recurrent neural networks with half-precision floats. We implement a distributed, data-parallel, synchronous training algorithm by integrating TensorFlow and CUDA-aware MPI to enable execution…

Machine Learning · Computer Science 2019-12-03 Alexey Svyatkovskiy , Julian Kates-Harbeck , William Tang

Scavenger: A Cloud Service for Optimizing Cost and Performance of ML Training

While the pay-as-you-go nature of cloud virtual machines (VMs) makes it easy to spin-up large clusters for training ML models, it can also lead to ballooning costs. The 100s of virtual machine sizes provided by cloud platforms also makes it…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-06 Sahil Tyagi , Prateek Sharma

Machine Learning on Volatile Instances

Due to the massive size of the neural network models and training datasets used in machine learning today, it is imperative to distribute stochastic gradient descent (SGD) by splitting up tasks such as gradient evaluation across multiple…

Machine Learning · Computer Science 2020-03-13 Xiaoxi Zhang , Jianyu Wang , Gauri Joshi , Carlee Joe-Wong