Related papers: Distributed Deep Learning Using Volunteer Computin…

Distributed Deep Learning in Open Collaborations

Modern deep learning applications require increasingly more compute to train state-of-the-art models. To address this demand, large corporations and institutions use dedicated High-Performance Computing clusters, whose construction and…

Machine Learning · Computer Science 2021-11-09 Michael Diskin , Alexey Bukhtiyarov , Max Ryabinin , Lucile Saulnier , Quentin Lhoest , Anton Sinitsin , Dmitry Popov , Dmitry Pyrkin , Maxim Kashirin , Alexander Borzunov , Albert Villanova del Moral , Denis Mazur , Ilia Kobelev , Yacine Jernite , Thomas Wolf , Gennady Pekhimenko

Machine Learning on Volatile Instances

Due to the massive size of the neural network models and training datasets used in machine learning today, it is imperative to distribute stochastic gradient descent (SGD) by splitting up tasks such as gradient evaluation across multiple…

Machine Learning · Computer Science 2020-03-13 Xiaoxi Zhang , Jianyu Wang , Gauri Joshi , Carlee Joe-Wong

DeepVM: Integrating Spot and On-Demand VMs for Cost-Efficient Deep Learning Clusters in the Cloud

Distributed Deep Learning (DDL), as a paradigm, dictates the use of GPU-based clusters as the optimal infrastructure for training large-scale Deep Neural Networks (DNNs). However, the high cost of such resources makes them inaccessible to…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-15 Yoochan Kim , Kihyun Kim , Yonghyeon Cho , Jinwoo Kim , Awais Khan , Ki-Dong Kang , Baik-Song An , Myung-Hoon Cha , Hong-Yeon Kim , Youngjae Kim

Scavenger: A Cloud Service for Optimizing Cost and Performance of ML Training

While the pay-as-you-go nature of cloud virtual machines (VMs) makes it easy to spin-up large clusters for training ML models, it can also lead to ballooning costs. The 100s of virtual machine sizes provided by cloud platforms also makes it…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-06 Sahil Tyagi , Prateek Sharma

Communication-Efficient Distributed Deep Learning: A Comprehensive Survey

Distributed deep learning (DL) has become prevalent in recent years to reduce training time by leveraging multiple computing devices (e.g., GPUs/TPUs) due to larger models and datasets. However, system scalability is limited by…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-04 Zhenheng Tang , Shaohuai Shi , Wei Wang , Bo Li , Xiaowen Chu

Oscars: Adaptive Semi-Synchronous Parallel Model for Distributed Deep Learning with Global View

Deep learning has become an indispensable part of life, such as face recognition, NLP, etc., but the training of deep model has always been a challenge, and in recent years, the complexity of training data and models has shown explosive…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-02-18 Sheng Huang

Differentiable Visual Computing for Inverse Problems and Machine Learning

Originally designed for applications in computer graphics, visual computing (VC) methods synthesize information about physical and virtual worlds, using prescribed algorithms optimized for spatial computing. VC is used to analyze geometry,…

Machine Learning · Computer Science 2023-12-11 Andrew Spielberg , Fangcheng Zhong , Konstantinos Rematas , Krishna Murthy Jatavallabhula , Cengiz Oztireli , Tzu-Mao Li , Derek Nowrouzezahrai

Distributed Deep Learning Using Synchronous Stochastic Gradient Descent

We design and implement a distributed multinode synchronous SGD algorithm, without altering hyper parameters, or compressing data, or altering algorithmic behavior. We perform a detailed analysis of scaling, and identify optimal design…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-03-02 Dipankar Das , Sasikanth Avancha , Dheevatsa Mudigere , Karthikeyan Vaidynathan , Srinivas Sridharan , Dhiraj Kalamkar , Bharat Kaul , Pradeep Dubey

VELTAIR: Towards High-Performance Multi-tenant Deep Learning Services via Adaptive Compilation and Scheduling

Deep learning (DL) models have achieved great success in many application domains. As such, many industrial companies such as Google and Facebook have acknowledged the importance of multi-tenant DL services. Although the multi-tenant…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-01-19 Zihan Liu , Jingwen Leng , Zhihui Zhang , Quan Chen , Chao Li , Minyi Guo

AutoDDL: Automatic Distributed Deep Learning with Near-Optimal Bandwidth Cost

Recent advances in deep learning are driven by the growing scale of computation, data, and models. However, efficiently training large-scale models on distributed systems requires an intricate combination of data, operator, and pipeline…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-22 Jinfan Chen , Shigang Li , Ran Gun , Jinhui Yuan , Torsten Hoefler

Speeding up Deep Learning with Transient Servers

Distributed training frameworks, like TensorFlow, have been proposed as a means to reduce the training time of deep learning models by using a cluster of GPU servers. While such speedups are often desirable---e.g., for rapidly evaluating…

Performance · Computer Science 2019-05-07 Shijian Li , Robert J. Walls , Lijie Xu , Tian Guo

COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training

Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train. Designing such clusters to maximize both performance and utilization--to amortize their steep cost--is a challenging…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-15 Divya Kiran Kadiyala , Saeed Rashidi , Taekyung Heo , Abhimanyu Rajeshkumar Bambhaniya , Tushar Krishna , Alexandros Daglis

Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-21 Shaohuai Shi , Xianhao Zhou , Shutao Song , Xingyao Wang , Zilin Zhu , Xue Huang , Xinan Jiang , Feihu Zhou , Zhenyu Guo , Liqiang Xie , Rui Lan , Xianbin Ouyang , Yan Zhang , Jieqian Wei , Jing Gong , Weiliang Lin , Ping Gao , Peng Meng , Xiaomin Xu , Chenyang Guo , Bo Yang , Zhibo Chen , Yongjian Wu , Xiaowen Chu

Distributed Training of Deep Learning Models: A Taxonomic Perspective

Distributed deep learning systems (DDLS) train deep neural network models by utilizing the distributed resources of a cluster. Developers of DDLS are required to make many decisions to process their particular workloads in their chosen…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-09 Matthias Langer , Zhen He , Wenny Rahayu , Yanbo Xue

Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools

Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-art results in various domains such as image recognition and natural language processing. One of the reasons for this success is the increasing size…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-26 Ruben Mayer , Hans-Arno Jacobsen

ANDREAS: Artificial intelligence traiNing scheDuler foR accElerAted resource clusterS

Artificial Intelligence (AI) and Deep Learning (DL) algorithms are currently applied to a wide range of products and solutions. DL training jobs are highly resource demanding and they experience great benefits when exploiting AI…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-12 Federica Filippini , Danilo Ardagna , Marco Lattuada , Edoardo Amaldi , Michele Ciavotta , Maciek Riedl , Katarzyna Materka , Paweł Skrzypek , Fabrizio Magugliani , Marco Cicala

Distributed and Deep Vertical Federated Learning with Big Data

In recent years, data are typically distributed in multiple organizations while the data security is becoming increasingly important. Federated Learning (FL), which enables multiple parties to collaboratively train a model without…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-03-13 Ji Liu , Xuehai Zhou , Lei Mo , Shilei Ji , Yuan Liao , Zheng Li , Qin Gu , Dejing Dou

A Survey of Distributed Learning in Cloud, Mobile, and Edge Settings

In the era of deep learning (DL), convolutional neural networks (CNNs), and large language models (LLMs), machine learning (ML) models are becoming increasingly complex, demanding significant computational resources for both inference and…

Machine Learning · Computer Science 2024-05-27 Madison Threadgill , Andreas Gerstlauer

Distributed Deep Learning using Stochastic Gradient Staleness

Despite the notable success of deep neural networks (DNNs) in solving complex tasks, the training process still remains considerable challenges. A primary obstacle is the substantial time required for training, particularly as high…

Machine Learning · Computer Science 2025-09-09 Viet Hoang Pham , Hyo-Sung Ahn

Isolated Scheduling for Distributed Training Tasks in GPU Clusters

Distributed machine learning (DML) technology makes it possible to train large neural networks in a reasonable amount of time. Meanwhile, as the computing power grows much faster than network capacity, network communication has gradually…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-11 Xinchi Han , Weihao Jiang , Peirui Cao , Qinwei Yang , Yunzhuo Liu , Shuyao Qi , Shengkai Lin , Shizhen Zhao