Related papers: Energy-Efficient GPU Clusters Scheduling for Deep …

GPU Cluster Scheduling for Network-Sensitive Deep Learning

We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the anticipated communication-network delays. Our scheduler…

Performance · Computer Science 2025-11-11 Aakash Sharma , Vivek M. Bhasi , Sonali Singh , George Kesidis , Mahmut T. Kandemir , Chita R. Das

Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-07 Qinghao Hu , Peng Sun , Shengen Yan , Yonggang Wen , Tianwei Zhang

Benchmarking Resource Usage for Efficient Distributed Deep Learning

Deep learning (DL) workflows demand an ever-increasing budget of compute and energy in order to achieve outsized gains. Neural architecture searches, hyperparameter sweeps, and rapid prototyping consume immense resources that can prevent…

Machine Learning · Computer Science 2022-02-01 Nathan C. Frey , Baolin Li , Joseph McDonald , Dan Zhao , Michael Jones , David Bestor , Devesh Tiwari , Vijay Gadepally , Siddharth Samsi

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-02 Wei Gao , Qinghao Hu , Zhisheng Ye , Peng Sun , Xiaolin Wang , Yingwei Luo , Tianwei Zhang , Yonggang Wen

Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters

Training Deep Neural Networks (DNNs) is a widely popular workload in both enterprises and cloud data centers. Existing schedulers for DNN training consider GPU as the dominant resource, and allocate other resources such as CPU and memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-25 Jayashree Mohan , Amar Phanishayee , Janardhan Kulkarni , Vijay Chidambaram

The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study

Over the past years, great progress has been made in improving the computing power of general-purpose graphics processing units (GPGPUs), which facilitates the prosperity of deep neural networks (DNNs) in multiple fields like computer…

Performance · Computer Science 2019-05-28 Zhenheng Tang , Yuxin Wang , Qiang Wang , Xiaowen Chu

Efficient Strong Scaling Through Burst Parallel Training

As emerging deep neural network (DNN) models continue to grow in size, using large GPU clusters to train DNNs is becoming an essential requirement to achieving acceptable training times. In this paper, we consider the case where future…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-25 Seo Jin Park , Joshua Fried , Sunghyun Kim , Mohammad Alizadeh , Adam Belay

Energy-aware Task Scheduling with Deadline Constraint in DVFS-enabled Heterogeneous Clusters

Energy conservation of large data centers for high-performance computing workloads, such as deep learning with big data, is of critical significance, where cutting down a few percent of electricity translates into million-dollar savings.…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-02 Xinxin Mei , Qiang Wang , Xiaowen Chu , Hai Liu , Yiu-Wing Leung , Zongpeng Li

Speeding up Deep Learning with Transient Servers

Distributed training frameworks, like TensorFlow, have been proposed as a means to reduce the training time of deep learning models by using a cluster of GPU servers. While such speedups are often desirable---e.g., for rapidly evaluating…

Performance · Computer Science 2019-05-07 Shijian Li , Robert J. Walls , Lijie Xu , Tian Guo

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

Deep learning (DL) has demonstrated significant success across diverse fields, leading to the construction of dedicated GPU accelerators within GPU clusters for high-quality training services. Efficient scheduler designs for such clusters…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-19 Yizhou Luo , Qiang Wang , Shaohuai Shi , Jiaxin Lai , Shuhan Qi , Jiajia Zhang , Xuan Wang

EaCO: Resource Sharing Dynamics and Its Impact on Energy Efficiency for DNN Training

Deep Learning Training (DLT) is a growing workload in shared GPU/CPU clusters due to its high computational cost and increasing number of jobs. This contributes to significant energy consumption in GPU clusters, further exacerbated by GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-12 Kawsar Haghshenas , Mona Hashemi

Network Contention-Aware Cluster Scheduling with Reinforcement Learning

With continuous advances in deep learning, distributed training is becoming common in GPU clusters. Specifically, for emerging workloads with diverse amounts, ratios, and patterns of communication, we observe that network contention can…

Machine Learning · Computer Science 2023-11-01 Junyeol Ryu , Jeongyoon Eo

Energy-aware Scheduling of Jobs in Heterogeneous Cluster Systems Using Deep Reinforcement Learning

Energy consumption is one of the most critical concerns in designing computing devices, ranging from portable embedded systems to computer cluster systems. Furthermore, in the past decade, cluster systems have increasingly risen as popular…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-12-12 Amirhossein Esmaili , Massoud Pedram

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products. These models are typically trained on shared, multi-tenant GPU clusters. Similar…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-08-09 Myeongjae Jeon , Shivaram Venkataraman , Amar Phanishayee , Junjie Qian , Wencong Xiao , Fan Yang

Dynamic GPU Energy Optimization for Machine Learning Training Workloads

GPUs are widely used to accelerate the training of machine learning workloads. As modern machine learning models become increasingly larger, they require a longer time to train, leading to higher GPU energy consumption. This paper presents…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-01-06 Farui Wang , Weizhe Zhang , Shichao Lai , Meng Hao , Zheng Wang

GNNFlow: A Distributed Framework for Continuous Temporal GNN Learning on Dynamic Graphs

Graph Neural Networks (GNNs) play a crucial role in various fields. However, most existing deep graph learning frameworks assume pre-stored static graphs and do not support training on graph streams. In contrast, many real-world graphs are…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-01 Yuchen Zhong , Guangming Sheng , Tianzuo Qin , Minjie Wang , Quan Gan , Chuan Wu

MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters

Large-scale GPU clusters are widely-used to speed up both latency-critical (online) and best-effort (offline) deep learning (DL) workloads. However, most DL clusters either dedicate each GPU to one workload or share workloads in time,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-03-27 Yihao Zhao , Xin Liu , Shufan Liu , Xiang Li , Yibo Zhu , Gang Huang , Xuanzhe Liu , Xin Jin

Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

Deep learning (DL) frameworks take advantage of GPUs to improve the speed of DL inference and training. Ideally, DL frameworks should be able to fully utilize the computation power of GPUs such that the running time depends on the amount of…

Machine Learning · Computer Science 2020-12-07 Woosuk Kwon , Gyeong-In Yu , Eunji Jeong , Byung-Gon Chun

Comparative Analysis of CPU and GPU Profiling for Deep Learning Models

Deep Learning(DL) and Machine Learning(ML) applications are rapidly increasing in recent days. Massive amounts of data are being generated over the internet which can derive meaningful results by the use of ML and DL algorithms. Hardware…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-12 Dipesh Gyawali

Energy-Efficient Flow Scheduling and Routing with Hard Deadlines in Data Center Networks

The power consumption of enormous network devices in data centers has emerged as a big concern to data center operators. Despite many traffic-engineering-based solutions, very little attention has been paid on performance-guaranteed energy…

Networking and Internet Architecture · Computer Science 2014-05-30 Lin Wang , Fa Zhang , Kai Zheng , Athanasios V. Vasilakos , Shaolei Ren , Zhiyong Liu