English
Related papers

Related papers: Energy-Efficient GPU Clusters Scheduling for Deep …

200 papers

We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the anticipated communication-network delays. Our scheduler…

Performance · Computer Science 2025-11-11 Aakash Sharma , Vivek M. Bhasi , Sonali Singh , George Kesidis , Mahmut T. Kandemir , Chita R. Das

Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-07 Qinghao Hu , Peng Sun , Shengen Yan , Yonggang Wen , Tianwei Zhang

Deep learning (DL) workflows demand an ever-increasing budget of compute and energy in order to achieve outsized gains. Neural architecture searches, hyperparameter sweeps, and rapid prototyping consume immense resources that can prevent…

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-02 Wei Gao , Qinghao Hu , Zhisheng Ye , Peng Sun , Xiaolin Wang , Yingwei Luo , Tianwei Zhang , Yonggang Wen

Training Deep Neural Networks (DNNs) is a widely popular workload in both enterprises and cloud data centers. Existing schedulers for DNN training consider GPU as the dominant resource, and allocate other resources such as CPU and memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-25 Jayashree Mohan , Amar Phanishayee , Janardhan Kulkarni , Vijay Chidambaram

Over the past years, great progress has been made in improving the computing power of general-purpose graphics processing units (GPGPUs), which facilitates the prosperity of deep neural networks (DNNs) in multiple fields like computer…

Performance · Computer Science 2019-05-28 Zhenheng Tang , Yuxin Wang , Qiang Wang , Xiaowen Chu

As emerging deep neural network (DNN) models continue to grow in size, using large GPU clusters to train DNNs is becoming an essential requirement to achieving acceptable training times. In this paper, we consider the case where future…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-25 Seo Jin Park , Joshua Fried , Sunghyun Kim , Mohammad Alizadeh , Adam Belay

Energy conservation of large data centers for high-performance computing workloads, such as deep learning with big data, is of critical significance, where cutting down a few percent of electricity translates into million-dollar savings.…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-02 Xinxin Mei , Qiang Wang , Xiaowen Chu , Hai Liu , Yiu-Wing Leung , Zongpeng Li

Distributed training frameworks, like TensorFlow, have been proposed as a means to reduce the training time of deep learning models by using a cluster of GPU servers. While such speedups are often desirable---e.g., for rapidly evaluating…

Performance · Computer Science 2019-05-07 Shijian Li , Robert J. Walls , Lijie Xu , Tian Guo

Deep learning (DL) has demonstrated significant success across diverse fields, leading to the construction of dedicated GPU accelerators within GPU clusters for high-quality training services. Efficient scheduler designs for such clusters…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-19 Yizhou Luo , Qiang Wang , Shaohuai Shi , Jiaxin Lai , Shuhan Qi , Jiajia Zhang , Xuan Wang

Deep Learning Training (DLT) is a growing workload in shared GPU/CPU clusters due to its high computational cost and increasing number of jobs. This contributes to significant energy consumption in GPU clusters, further exacerbated by GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-12 Kawsar Haghshenas , Mona Hashemi

With continuous advances in deep learning, distributed training is becoming common in GPU clusters. Specifically, for emerging workloads with diverse amounts, ratios, and patterns of communication, we observe that network contention can…

Machine Learning · Computer Science 2023-11-01 Junyeol Ryu , Jeongyoon Eo

Energy consumption is one of the most critical concerns in designing computing devices, ranging from portable embedded systems to computer cluster systems. Furthermore, in the past decade, cluster systems have increasingly risen as popular…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-12-12 Amirhossein Esmaili , Massoud Pedram

With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products. These models are typically trained on shared, multi-tenant GPU clusters. Similar…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-08-09 Myeongjae Jeon , Shivaram Venkataraman , Amar Phanishayee , Junjie Qian , Wencong Xiao , Fan Yang

GPUs are widely used to accelerate the training of machine learning workloads. As modern machine learning models become increasingly larger, they require a longer time to train, leading to higher GPU energy consumption. This paper presents…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-01-06 Farui Wang , Weizhe Zhang , Shichao Lai , Meng Hao , Zheng Wang

Graph Neural Networks (GNNs) play a crucial role in various fields. However, most existing deep graph learning frameworks assume pre-stored static graphs and do not support training on graph streams. In contrast, many real-world graphs are…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-01 Yuchen Zhong , Guangming Sheng , Tianzuo Qin , Minjie Wang , Quan Gan , Chuan Wu

Large-scale GPU clusters are widely-used to speed up both latency-critical (online) and best-effort (offline) deep learning (DL) workloads. However, most DL clusters either dedicate each GPU to one workload or share workloads in time,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-03-27 Yihao Zhao , Xin Liu , Shufan Liu , Xiang Li , Yibo Zhu , Gang Huang , Xuanzhe Liu , Xin Jin

Deep learning (DL) frameworks take advantage of GPUs to improve the speed of DL inference and training. Ideally, DL frameworks should be able to fully utilize the computation power of GPUs such that the running time depends on the amount of…

Machine Learning · Computer Science 2020-12-07 Woosuk Kwon , Gyeong-In Yu , Eunji Jeong , Byung-Gon Chun

Deep Learning(DL) and Machine Learning(ML) applications are rapidly increasing in recent days. Massive amounts of data are being generated over the internet which can derive meaningful results by the use of ML and DL algorithms. Hardware…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-12 Dipesh Gyawali

The power consumption of enormous network devices in data centers has emerged as a big concern to data center operators. Despite many traffic-engineering-based solutions, very little attention has been paid on performance-guaranteed energy…

Networking and Internet Architecture · Computer Science 2014-05-30 Lin Wang , Fa Zhang , Kai Zheng , Athanasios V. Vasilakos , Shaolei Ren , Zhiyong Liu
‹ Prev 1 2 3 10 Next ›