English
Related papers

Related papers: GPU Cluster Scheduling for Network-Sensitive Deep …

200 papers

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-02 Wei Gao , Qinghao Hu , Zhisheng Ye , Peng Sun , Xiaolin Wang , Yingwei Luo , Tianwei Zhang , Yonggang Wen

Training deep neural networks (DNNs) is a major workload in datacenters today, resulting in a tremendously fast growth of energy consumption. It is important to reduce the energy consumption while completing the DL training jobs early in…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-16 Diandian Gu , Xintong Xie , Gang Huang , Xin Jin , Xuanzhe Liu

Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-07 Qinghao Hu , Peng Sun , Shengen Yan , Yonggang Wen , Tianwei Zhang

Distributed Deep Learning (DDL) has rapidly grown its popularity since it helps boost the training performance on high-performance GPU clusters. Efficient job scheduling is indispensable to maximize the overall performance of the cluster…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-25 Qiang Wang , Shaohuai Shi , Canhui Wang , Xiaowen Chu

The recent explosive growth of deep learning (DL) models has necessitated a compelling need for efficient job scheduling for distributed deep learning training with mixed parallelisms (DDLwMP) in GPU clusters. This paper proposes an…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-13 Ziyue Luo , Jia Liu , Myungjin Lee , Ness B. Shroff

Deep learning (DL) has demonstrated significant success across diverse fields, leading to the construction of dedicated GPU accelerators within GPU clusters for high-quality training services. Efficient scheduler designs for such clusters…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-19 Yizhou Luo , Qiang Wang , Shaohuai Shi , Jiaxin Lai , Shuhan Qi , Jiajia Zhang , Xuan Wang

Training Deep Neural Networks (DNNs) is a widely popular workload in both enterprises and cloud data centers. Existing schedulers for DNN training consider GPU as the dominant resource, and allocate other resources such as CPU and memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-25 Jayashree Mohan , Amar Phanishayee , Janardhan Kulkarni , Vijay Chidambaram

With continuous advances in deep learning, distributed training is becoming common in GPU clusters. Specifically, for emerging workloads with diverse amounts, ratios, and patterns of communication, we observe that network contention can…

Machine Learning · Computer Science 2023-11-01 Junyeol Ryu , Jeongyoon Eo

More and more companies have deployed machine learning (ML) clusters, where deep learning (DL) models are trained for providing various AI-driven services. Efficient resource scheduling is essential for maximal utilization of expensive DL…

Machine Learning · Computer Science 2019-09-16 Yanghua Peng , Yixin Bao , Yangrui Chen , Chuan Wu , Chen Meng , Wei Lin

Efficient scheduling of distributed deep learning (DL) jobs in large GPU clusters is crucial for resource efficiency and job performance. While server sharing among jobs improves resource utilization, interference among co-located DL jobs…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-28 Xiaoyang Zhao , Chuan Wu

Distributed deep learning workloads include throughput-intensive training tasks on the GPU clusters, where the Distributed Stochastic Gradient Descent (SGD) incurs significant communication delays after backward propagation, forces workers…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-16 Cheng Luo , Lei Qu , Youshan Miao , Peng Cheng , Yongqiang Xiong

With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products. These models are typically trained on shared, multi-tenant GPU clusters. Similar…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-08-09 Myeongjae Jeon , Shivaram Venkataraman , Amar Phanishayee , Junjie Qian , Wencong Xiao , Fan Yang

Modern cloud platforms increasingly host large-scale deep learning (DL) workloads, demanding high-throughput, low-latency GPU scheduling. However, the growing heterogeneity of GPU clusters and limited visibility into application…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-12 Shruti Dongare , Redwan Ibne Seraj Khan , Hadeel Albahar , Nannan Zhao , Diego Melendez Maita , Ali R. Butt

Specialized accelerators such as GPUs, TPUs, FPGAs, and custom ASICs have been increasingly deployed to train deep learning models. These accelerators exhibit heterogeneous performance behavior across model architectures. Existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-08-24 Deepak Narayanan , Keshav Santhanam , Fiodar Kazhamiaka , Amar Phanishayee , Matei Zaharia

Distributed machine learning (DML) technology makes it possible to train large neural networks in a reasonable amount of time. Meanwhile, as the computing power grows much faster than network capacity, network communication has gradually…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-11 Xinchi Han , Weihao Jiang , Peirui Cao , Qinwei Yang , Yunzhuo Liu , Shuyao Qi , Shengkai Lin , Shizhen Zhao

Scheduling deep learning (DL) models to train on powerful clusters with accelerators like GPUs and TPUs, presently falls short, either lacking fine-grained heterogeneity awareness or leaving resources substantially under-utilized. To fill…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-17 Abeda Sultana , Nabin Pakka , Fei Xu , Xu Yuan , Li Chen , Nian-Feng Tzeng

Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors both at the per-job level and at the cluster-wide level. Most existing schedulers expect users to specify the number…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-27 Aurick Qiao , Sang Keun Choe , Suhas Jayaram Subramanya , Willie Neiswanger , Qirong Ho , Hao Zhang , Gregory R. Ganger , Eric P. Xing

Distributed cloud environments hosting data-intensive applications often experience slowdowns due to network congestion, asymmetric bandwidth, and inter-node data shuffling. These factors are typically not captured by traditional host-level…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-21 Sankalpa Timilsina , Susmit Shannigrahi

Fueled by advances in distributed deep learning (DDL), recent years have witnessed a rapidly growing demand for resource-intensive distributed/parallel computing to process DDL computing jobs. To resolve network communication bottleneck and…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-03 Menglu Yu , Ye Tian , Bo Ji , Chuan Wu , Hridesh Rajan , Jia Liu

We consider ML query processing in distributed systems where GPU-enabled workers coordinate to execute complex queries: a computing style often seen in applications that interact with users in support of image processing and natural…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-02-29 Yuting Yang , Andrea Merlina , Weijia Song , Tiancheng Yuan , Ken Birman , Roman Vitenberg
‹ Prev 1 2 3 10 Next ›