Related papers: Dynamic Space-Time Scheduling for GPU Inference

Spatial Sharing of GPU for Autotuning DNN models

GPUs are used for training, inference, and tuning the machine learning models. However, Deep Neural Network (DNN) vary widely in their ability to exploit the full power of high-performance GPUs. Spatial sharing of GPU enables multiplexing…

Neural and Evolutionary Computing · Computer Science 2020-08-11 Aditya Dhakal , Junguk Cho , Sameer G. Kulkarni , K. K. Ramakrishnan , Puneet Sharma

D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUs

Hardware accelerators such as GPUs are required for real-time, low-latency inference with Deep Neural Networks (DNN). However, due to the inherent limits to the parallelism they can exploit, DNNs often under-utilize the capacity of today's…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-04-27 Aditya Dhakal , Sameer G. Kulkarni , K. K. Ramakrishnan

Optimizing LLM Inference Throughput via Memory-aware and SLA-constrained Dynamic Batching

The increasing adoption of large language models (LLMs) necessitates inference serving systems that can deliver both high throughput and low latency. Deploying LLMs with hundreds of billions of parameters on memory-constrained GPUs exposes…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-10 Bowen Pang , Kai Li , Feifan Wang

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

With the fast development of deep neural networks (DNNs), many real-world applications are adopting multiple models to conduct compound tasks, such as co-running classification, detection, and segmentation models on autonomous vehicles.…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-30 Fuxun Yu , Shawn Bray , Di Wang , Longfei Shangguan , Xulong Tang , Chenchen Liu , Xiang Chen

Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning

As machine learning techniques are applied to a widening range of applications, high throughput machine learning (ML) inference servers have become critical for online service applications. Such ML inference servers pose two challenges:…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-06 Seungbeom Choi , Sunho Lee , Yeonjae Kim , Jongse Park , Youngjin Kwon , Jaehyuk Huh

A Survey of Multi-Tenant Deep Learning Inference on GPU

Deep Learning (DL) models have achieved superior performance. Meanwhile, computing hardware like NVIDIA GPUs also demonstrated strong computing scaling trends with 2x throughput and memory bandwidth for each generation. With such strong…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-26 Fuxun Yu , Di Wang , Longfei Shangguan , Minjia Zhang , Chenchen Liu , Xiang Chen

Queueing Analysis of GPU-Based Inference Servers with Dynamic Batching: A Closed-Form Characterization

GPU-accelerated computing is a key technology to realize high-speed inference servers using deep neural networks (DNNs). An important characteristic of GPU-based inference is that the computational efficiency, in terms of the processing…

Performance · Computer Science 2021-01-13 Yoshiaki Inoue

Accelerating Exact and Approximate Inference for (Distributed) Discrete Optimization with GPUs

Discrete optimization is a central problem in artificial intelligence. The optimization of the aggregated cost of a network of cost functions arises in a variety of problems including (W)CSP, DCOP, as well as optimization in stochastic…

Artificial Intelligence · Computer Science 2018-01-12 Ferdinando Fioretto , Enrico Pontelli , William Yeoh , Rina Dechter

FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference

Serverless computing (FaaS) has been extensively utilized for deep learning (DL) inference due to the ease of deployment and pay-per-use benefits. However, existing FaaS platforms utilize GPUs in a coarse manner for DL inferences, without…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-04 Jianfeng Gu , Yichao Zhu , Puxuan Wang , Mohak Chadha , Michael Gerndt

ML Inference Scheduling with Predictable Latency

Machine learning (ML) inference serving systems can schedule requests to improve GPU utilization and to meet service level objectives (SLOs) or deadlines. However, improving GPU utilization may compromise latency-sensitive scheduling, as…

Machine Learning · Computer Science 2025-12-25 Haidong Zhao , Nikolaos Georgantas

ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud Environments

In cloud environments, GPU-based deep neural network (DNN) inference servers are required to meet the Service Level Objective (SLO) latency for each workload under a specified request rate, while also minimizing GPU resource consumption.…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-24 Munkyu Lee , Sihoon Seong , Minki Kang , Jihyuk Lee , Gap-Joo Na , In-Geol Chun , Dimitrios Nikolopoulos , Cheol-Ho Hong

GPU Cluster Scheduling for Network-Sensitive Deep Learning

We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the anticipated communication-network delays. Our scheduler…

Performance · Computer Science 2025-11-11 Aakash Sharma , Vivek M. Bhasi , Sonali Singh , George Kesidis , Mahmut T. Kandemir , Chita R. Das

Boosting LLM Serving through Spatial-Temporal GPU Resource Sharing

Modern LLM serving systems confront inefficient GPU utilization due to the fundamental mismatch between compute-intensive prefill and memory-bound decode phases. While current practices attempt to address this by organizing these phases…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-29 Zejia Lin , Hongxin Xu , Guanyi Chen , Zhiguang Chen , Yutong Lu , Xianwei Zhang

Optimizing Performance of Recurrent Neural Networks on GPUs

As recurrent neural networks become larger and deeper, training times for single networks are rising into weeks or even months. As such there is a significant incentive to improve the performance and scalability of these networks. While…

Machine Learning · Computer Science 2016-04-08 Jeremy Appleyard , Tomas Kocisky , Phil Blunsom

DARIS: An Oversubscribed Spatio-Temporal Scheduler for Real-Time DNN Inference on GPUs

The widespread use of Deep Neural Networks (DNNs) is limited by high computational demands, especially in constrained environments. GPUs, though effective accelerators, often face underutilization and rely on coarse-grained scheduling. This…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-15 Amir Fakhim Babaei , Thidapat Chantem

Efficient Strong Scaling Through Burst Parallel Training

As emerging deep neural network (DNN) models continue to grow in size, using large GPU clusters to train DNNs is becoming an essential requirement to achieving acceptable training times. In this paper, we consider the case where future…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-25 Seo Jin Park , Joshua Fried , Sunghyun Kim , Mohammad Alizadeh , Adam Belay

Understanding GPU Resource Interference One Level Deeper

GPUs are vastly underutilized, even when running resource-intensive AI applications, as GPU kernels within each job have diverse resource profiles that may saturate some parts of a device while often leaving other parts idle. Colocating…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-17 Paul Elvinger , Foteini Strati , Natalie Enright Jerger , Ana Klimovic

A GPU-Accelerated Distributed Algorithm for Optimal Power Flow in Distribution Systems

We propose a GPU-accelerated distributed optimization algorithm for controlling multi-phase optimal power flow in active distribution systems with dynamically changing topologies. To handle varying network configurations and enable…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-15 Minseok Ryu , Geunyeong Byeon , Kibaek Kim

Throughput Maximization of DNN Inference: Batching or Multi-Tenancy?

Deployment of real-time ML services on warehouse-scale infrastructures is on the increase. Therefore, decreasing latency and increasing throughput of deep neural network (DNN) inference applications that empower those services have…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-29 Seyed Morteza Nabavinejad , Masoumeh Ebrahimi , Sherief Reda

Efficient Data-Parallel Continual Learning with Asynchronous Distributed Rehearsal Buffers

Deep learning has emerged as a powerful method for extracting valuable information from large volumes of data. However, when new training data arrives continuously (i.e., is not fully available from the beginning), incremental training…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-06 Thomas Bouvier , Bogdan Nicolae , Hugo Chaugier , Alexandru Costan , Ian Foster , Gabriel Antoniu