Related papers: Nimble: Lightweight and Parallel GPU Task Scheduli…

Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference

Modern deep neural networks increasingly make use of features such as dynamic control flow, data structures and dynamic tensor shapes. Existing deep learning systems focus on optimizing and executing static neural networks which assume a…

Programming Languages · Computer Science 2021-03-15 Haichen Shen , Jared Roesch , Zhi Chen , Wei Chen , Yong Wu , Mu Li , Vin Sharma , Zachary Tatlock , Yida Wang

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

With the fast development of deep neural networks (DNNs), many real-world applications are adopting multiple models to conduct compound tasks, such as co-running classification, detection, and segmentation models on autonomous vehicles.…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-30 Fuxun Yu , Shawn Bray , Di Wang , Longfei Shangguan , Xulong Tang , Chenchen Liu , Xiang Chen

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-03 Shiqing Fan , Yi Rong , Chen Meng , Zongyan Cao , Siyu Wang , Zhen Zheng , Chuan Wu , Guoping Long , Jun Yang , Lixue Xia , Lansong Diao , Xiaoyong Liu , Wei Lin

Comparative Analysis of CPU and GPU Profiling for Deep Learning Models

Deep Learning(DL) and Machine Learning(ML) applications are rapidly increasing in recent days. Massive amounts of data are being generated over the internet which can derive meaningful results by the use of ML and DL algorithms. Hardware…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-12 Dipesh Gyawali

Unleashing the Power of Preemptive Priority-based Scheduling for Real-Time GPU Tasks

Scheduling real-time tasks that utilize GPUs with analyzable guarantees poses a significant challenge due to the intricate interaction between CPU and GPU resources, as well as the complex GPU hardware and software stack. While much…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-01-31 Yidi Wang , Cong Liu , Daniel Wong , Hyoseung Kim

Brief Announcement: On the Limits of Parallelizing Convolutional Neural Networks on GPUs

GPUs are currently the platform of choice for training neural networks. However, training a deep neural network (DNN) is a time-consuming process even on GPUs because of the massive number of parameters that have to be learned. As a result,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-05-29 Behnam Pourghassemi , Chenghao Zhang , Joo Hwan Lee , Aparna Chandramowlishwaran

RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks with Fine-Grain Utilization

Many emerging cyber-physical systems, such as autonomous vehicles and robots, rely heavily on artificial intelligence and machine learning algorithms to perform important system operations. Since these highly parallel applications are…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-02-07 An Zou , Jing Li , Christopher D. Gill , Xuan Zhang

Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling

Graphics processors, or GPUs, have recently been widely used as accelerators in the shared environments such as clusters and clouds. In such shared environments, many kernels are submitted to GPUs from different users, and throughput is an…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-03-22 Jianlong Zhong , Bingsheng He

Energy-Efficient GPU Clusters Scheduling for Deep Learning

Training deep neural networks (DNNs) is a major workload in datacenters today, resulting in a tremendously fast growth of energy consumption. It is important to reduce the energy consumption while completing the DL training jobs early in…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-16 Diandian Gu , Xintong Xie , Gang Huang , Xin Jin , Xuanzhe Liu

SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

Deep Learning (DL) algorithms are the central focus of modern machine learning systems. As data volumes keep growing, it has become customary to train large neural networks with hundreds of millions of parameters to maintain enough capacity…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-03-03 Beidi Chen , Tharun Medini , James Farwell , Sameh Gobriel , Charlie Tai , Anshumali Shrivastava

Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

Modern cloud platforms increasingly host large-scale deep learning (DL) workloads, demanding high-throughput, low-latency GPU scheduling. However, the growing heterogeneity of GPU clusters and limited visibility into application…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-12 Shruti Dongare , Redwan Ibne Seraj Khan , Hadeel Albahar , Nannan Zhao , Diego Melendez Maita , Ali R. Butt

Automatic Task Parallelization of Dataflow Graphs in ML/DL models

Several methods exist today to accelerate Machine Learning(ML) or Deep-Learning(DL) model performance for training and inference. However, modern techniques that rely on various graph and operator parallelism methodologies rely on search…

Machine Learning · Computer Science 2023-08-23 Srinjoy Das , Lawrence Rauchwerger

Concurrent Scheduling of High-Level Parallel Programs on Multi-GPU Systems

Parallel programming models can encourage performance portability by moving the responsibility for work assignment and data distribution from the programmer to a runtime system. However, analyzing the resulting implicit memory allocations,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-14 Fabian Knorr , Philip Salzmann , Peter Thoman , Thomas Fahringer

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-02 Wei Gao , Qinghao Hu , Zhisheng Ye , Peng Sun , Xiaolin Wang , Yingwei Luo , Tianwei Zhang , Yonggang Wen

Scheduling Computation Graphs of Deep Learning Models on Manycore CPUs

For a deep learning model, efficient execution of its computation graph is key to achieving high performance. Previous work has focused on improving the performance for individual nodes of the computation graph, while ignoring the…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-07-26 Linpeng Tang , Yida Wang , Theodore L. Willke , Kai Li

FIKIT: Priority-Based Real-time GPU Multi-tasking Scheduling with Kernel Identification

Highly parallelized workloads like machine learning training, inferences and general HPC tasks are greatly accelerated using GPU devices. In a cloud computing cluster, serving a GPU's computation power through multi-tasks sharing is highly…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-02-05 Wenqing Wu

AxoNN: An asynchronous, message-driven parallel framework for extreme-scale deep learning

In the last few years, the memory requirements to train state-of-the-art neural networks have far exceeded the DRAM capacities of modern hardware accelerators. This has necessitated the development of efficient algorithms to train these…

Machine Learning · Computer Science 2023-05-16 Siddharth Singh , Abhinav Bhatele

DistTGL: Distributed Memory-Based Temporal Graph Neural Network Training

Memory-based Temporal Graph Neural Networks are powerful tools in dynamic graph representation learning and have demonstrated superior performance in many real-world applications. However, their node memory favors smaller batch sizes to…

Machine Learning · Computer Science 2023-07-18 Hongkuan Zhou , Da Zheng , Xiang Song , George Karypis , Viktor Prasanna

GPU Cluster Scheduling for Network-Sensitive Deep Learning

We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the anticipated communication-network delays. Our scheduler…

Performance · Computer Science 2025-11-11 Aakash Sharma , Vivek M. Bhasi , Sonali Singh , George Kesidis , Mahmut T. Kandemir , Chita R. Das

Experiments on Parallel Training of Deep Neural Network using Model Averaging

In this work we apply model averaging to parallel training of deep neural network (DNN). Parallelization is done in a model averaging manner. Data is partitioned and distributed to different nodes for local model updates, and model…

Machine Learning · Computer Science 2018-07-03 Hang Su , Haoyu Chen