Related papers: GPU Cluster Scheduling for Network-Sensitive Deep …

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-02 Wei Gao , Qinghao Hu , Zhisheng Ye , Peng Sun , Xiaolin Wang , Yingwei Luo , Tianwei Zhang , Yonggang Wen

Energy-Efficient GPU Clusters Scheduling for Deep Learning

Training deep neural networks (DNNs) is a major workload in datacenters today, resulting in a tremendously fast growth of energy consumption. It is important to reduce the energy consumption while completing the DL training jobs early in…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-16 Diandian Gu , Xintong Xie , Gang Huang , Xin Jin , Xuanzhe Liu

Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-07 Qinghao Hu , Peng Sun , Shengen Yan , Yonggang Wen , Tianwei Zhang

Communication Contention Aware Scheduling of Multiple Deep Learning Training Jobs

Distributed Deep Learning (DDL) has rapidly grown its popularity since it helps boost the training performance on high-performance GPU clusters. Efficient job scheduling is indispensable to maximize the overall performance of the cluster…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-25 Qiang Wang , Shaohuai Shi , Canhui Wang , Xiaowen Chu

Prediction-Assisted Online Distributed Deep Learning Workload Scheduling in GPU Clusters

The recent explosive growth of deep learning (DL) models has necessitated a compelling need for efficient job scheduling for distributed deep learning training with mixed parallelisms (DDLwMP) in GPU clusters. This paper proposes an…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-13 Ziyue Luo , Jia Liu , Myungjin Lee , Ness B. Shroff

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

Deep learning (DL) has demonstrated significant success across diverse fields, leading to the construction of dedicated GPU accelerators within GPU clusters for high-quality training services. Efficient scheduler designs for such clusters…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-19 Yizhou Luo , Qiang Wang , Shaohuai Shi , Jiaxin Lai , Shuhan Qi , Jiajia Zhang , Xuan Wang

Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters

Training Deep Neural Networks (DNNs) is a widely popular workload in both enterprises and cloud data centers. Existing schedulers for DNN training consider GPU as the dominant resource, and allocate other resources such as CPU and memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-25 Jayashree Mohan , Amar Phanishayee , Janardhan Kulkarni , Vijay Chidambaram

Network Contention-Aware Cluster Scheduling with Reinforcement Learning

With continuous advances in deep learning, distributed training is becoming common in GPU clusters. Specifically, for emerging workloads with diverse amounts, ratios, and patterns of communication, we observe that network contention can…

Machine Learning · Computer Science 2023-11-01 Junyeol Ryu , Jeongyoon Eo

DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters

More and more companies have deployed machine learning (ML) clusters, where deep learning (DL) models are trained for providing various AI-driven services. Efficient resource scheduling is essential for maximal utilization of expensive DL…

Machine Learning · Computer Science 2019-09-16 Yanghua Peng , Yixin Bao , Yangrui Chen , Chuan Wu , Chen Meng , Wei Lin

Large-scale Machine Learning Cluster Scheduling via Multi-agent Graph Reinforcement Learning

Efficient scheduling of distributed deep learning (DL) jobs in large GPU clusters is crucial for resource efficiency and job performance. While server sharing among jobs improves resource utilization, interference among co-located DL jobs…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-28 Xiaoyang Zhao , Chuan Wu

CrossoverScheduler: Overlapping Multiple Distributed Training Applications in a Crossover Manner

Distributed deep learning workloads include throughput-intensive training tasks on the GPU clusters, where the Distributed Stochastic Gradient Descent (SGD) incurs significant communication delays after backward propagation, forces workers…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-16 Cheng Luo , Lei Qu , Youshan Miao , Peng Cheng , Yongqiang Xiong

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products. These models are typically trained on shared, multi-tenant GPU clusters. Similar…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-08-09 Myeongjae Jeon , Shivaram Venkataraman , Amar Phanishayee , Junjie Qian , Wencong Xiao , Fan Yang

Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

Modern cloud platforms increasingly host large-scale deep learning (DL) workloads, demanding high-throughput, low-latency GPU scheduling. However, the growing heterogeneity of GPU clusters and limited visibility into application…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-12 Shruti Dongare , Redwan Ibne Seraj Khan , Hadeel Albahar , Nannan Zhao , Diego Melendez Maita , Ali R. Butt

Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads

Specialized accelerators such as GPUs, TPUs, FPGAs, and custom ASICs have been increasingly deployed to train deep learning models. These accelerators exhibit heterogeneous performance behavior across model architectures. Existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-08-24 Deepak Narayanan , Keshav Santhanam , Fiodar Kazhamiaka , Amar Phanishayee , Matei Zaharia

Isolated Scheduling for Distributed Training Tasks in GPU Clusters

Distributed machine learning (DML) technology makes it possible to train large neural networks in a reasonable amount of time. Meanwhile, as the computing power grows much faster than network capacity, network communication has gradually…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-11 Xinchi Han , Weihao Jiang , Peirui Cao , Qinwei Yang , Yunzhuo Liu , Shuyao Qi , Shengkai Lin , Shizhen Zhao

Resource Heterogeneity-Aware and Utilization-Enhanced Scheduling for Deep Learning Clusters

Scheduling deep learning (DL) models to train on powerful clusters with accelerators like GPUs and TPUs, presently falls short, either lacking fine-grained heterogeneity awareness or leaving resources substantially under-utilized. To fill…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-17 Abeda Sultana , Nabin Pakka , Fei Xu , Xu Yuan , Li Chen , Nian-Feng Tzeng

Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning

Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors both at the per-job level and at the cluster-wide level. Most existing schedulers expect users to specify the number…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-27 Aurick Qiao , Sang Keun Choe , Suhas Jayaram Subramanya , Willie Neiswanger , Qirong Ho , Hao Zhang , Gregory R. Ganger , Eric P. Xing

Learning to Schedule: A Supervised Learning Framework for Network-Aware Scheduling of Data-Intensive Workloads

Distributed cloud environments hosting data-intensive applications often experience slowdowns due to network congestion, asymmetric bandwidth, and inter-node data shuffling. These factors are typically not captured by traditional host-level…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-21 Sankalpa Timilsina , Susmit Shannigrahi

GADGET: Online Resource Optimization for Scheduling Ring-All-Reduce Learning Jobs

Fueled by advances in distributed deep learning (DDL), recent years have witnessed a rapidly growing demand for resource-intensive distributed/parallel computing to process DDL computing jobs. To resolve network communication bottleneck and…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-03 Menglu Yu , Ye Tian , Bo Ji , Chuan Wu , Hridesh Rajan , Jia Liu

Compass: A Decentralized Scheduler for Latency-Sensitive ML Workflows

We consider ML query processing in distributed systems where GPU-enabled workers coordinate to execute complex queries: a computing style often seen in applications that interact with users in support of image processing and natural…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-02-29 Yuting Yang , Andrea Merlina , Weijia Song , Tiancheng Yuan , Ken Birman , Roman Vitenberg