Related papers: Understanding GPU Resource Interference One Level …

Contention-Aware GPU Partitioning and Task-to-Partition Allocation for Real-Time Workloads

In order to satisfy timing constraints, modern real-time applications require massively parallel accelerators such as General Purpose Graphic Processing Units (GPGPUs). Generation after generation, the number of computing clusters made…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-24 Houssam-Eddine Zahaf , Ignacio Sanudo Olmedo , Jayati Singh , Nicola Capodieci , Sebastien Faucou

GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations

Collocating deep learning training tasks improves GPU utilization but risks resource contention, severe slowdowns, and out-of-memory (OOM) failures. Accurate memory estimation is essential for robust collocation, and GPU utilization…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-29 Ehsan Yousefzadeh-Asl-Miandoab , Reza Karimzadeh , Danyal Yorulmaz , Bulat Ibragimov , Pınar Tözün

Effective GPU Sharing Under Compiler Guidance

Modern computing platforms tend to deploy multiple GPUs (2, 4, or more) on a single node to boost system performance, with each GPU having a large capacity of global memory and streaming multiprocessors (SMs). GPUs are an expensive…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-07-20 Chao Chen , Chris Porter , Santosh Pande

Techniques for Shared Resource Management in Systems with Throughput Processors

The continued growth of the computational capability of throughput processors has made throughput processors the platform of choice for a wide variety of high performance computing applications. Graphics Processing Units (GPUs) are a prime…

Hardware Architecture · Computer Science 2018-05-01 Rachata Ausavarungnirun

Intra-node Memory Safe GPU Co-Scheduling

GPUs in High-Performance Computing systems remain under-utilised due to the unavailability of schedulers that can safely schedule multiple applications to share the same GPU. The research reported in this paper is motivated to improve the…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-12-14 Carlos Reano , Federico Silla , Dimitrios S. Nikolopoulos , Blesson Varghese

Cooperative Kernels: GPU Multitasking for Blocking Algorithms (Extended Version)

There is growing interest in accelerating irregular data-parallel algorithms on GPUs. These algorithms are typically blocking, so they require fair scheduling. But GPU programming models (e.g.\ OpenCL) do not mandate fair scheduling, and…

Programming Languages · Computer Science 2017-07-10 Tyler Sorensen , Hugues Evrard , Alastair F. Donaldson

ML Inference Scheduling with Predictable Latency

Machine learning (ML) inference serving systems can schedule requests to improve GPU utilization and to meet service level objectives (SLOs) or deadlines. However, improving GPU utilization may compromise latency-sensitive scheduling, as…

Machine Learning · Computer Science 2025-12-25 Haidong Zhao , Nikolaos Georgantas

Optimizing GPU Cache Policies for MI Workloads

In recent years, machine intelligence (MI) applications have emerged as a major driver for the computing industry. Optimizing these workloads is important but complicated. As memory demands grow and data movement overheads increasingly…

Hardware Architecture · Computer Science 2019-10-02 Johnathan Alsop , Matthew D. Sinclair , Srikant Bharadwaj , Alexandru Dutu , Anthony Gutierrez , Onur Kayiran , Michael LeBeane , Sooraj Puthoor , Xianwei Zhang , Tsung Tai Yeh , Bradford M. Beckmann

Large-scale Machine Learning Cluster Scheduling via Multi-agent Graph Reinforcement Learning

Efficient scheduling of distributed deep learning (DL) jobs in large GPU clusters is crucial for resource efficiency and job performance. While server sharing among jobs improves resource utilization, interference among co-located DL jobs…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-28 Xiaoyang Zhao , Chuan Wu

Optimal Workload Placement on Multi-Instance GPUs

There is an urgent and pressing need to optimize usage of Graphical Processing Units (GPUs), which have arguably become one of the most expensive and sought after IT resources. To help with this goal, several of the current generation of…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-11 Bekir Turkkan , Pavankumar Murali , Pavithra Harsha , Rohan Arora , Gerard Vanloo , Chandra Narayanaswami

Characterizing CPU-Induced Slowdowns in Multi-GPU LLM Inference

Large-scale machine learning workloads increasingly rely on multi-GPU systems, yet their performance is often limited by an overlooked component: the CPU. Through a detailed study of modern large language model (LLM) inference and serving…

Hardware Architecture · Computer Science 2026-05-26 Euijun Chung , Yuxiao Jia , Aaron Jezghani , Hyesoon Kim

Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs

Recent advancements in Large Language Models (LLMs) have led to increasingly diverse requests, accompanied with varying resource (compute and memory) demands to serve them. However, this in turn degrades the cost-efficiency of LLM serving…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-06 Youhe Jiang , Fangcheng Fu , Xiaozhe Yao , Guoliang He , Xupeng Miao , Ana Klimovic , Bin Cui , Binhang Yuan , Eiko Yoneki

iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud

GPUs are essential to accelerating the latency-sensitive deep neural network (DNN) inference workloads in cloud datacenters. To fully utilize GPU resources, spatial sharing of GPUs among co-located DNN inference workloads becomes…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-04 Fei Xu , Jianian Xu , Jiabin Chen , Li Chen , Ruitao Shang , Zhi Zhou , Fangming Liu

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-02 Wei Gao , Qinghao Hu , Zhisheng Ye , Peng Sun , Xiaolin Wang , Yingwei Luo , Tianwei Zhang , Yonggang Wen

Recent Advances in Overcoming Bottlenecks in Memory Systems and Managing Memory Resources in GPU Systems

This article features extended summaries and retrospectives of some of the recent research done by our research group, SAFARI, on (1) various critical problems in memory systems and (2) how memory system bottlenecks affect graphics…

Hardware Architecture · Computer Science 2018-05-30 Onur Mutlu , Saugata Ghose , Rachata Ausavarungnirun

SpecOffload: Unlocking Latent GPU Capacity for LLM Inference on Resource-Constrained Devices

Efficient LLM inference on resource-constrained devices presents significant challenges in compute and memory utilization. Due to limited GPU memory, existing systems offload model weights to CPU memory, incurring substantial I/O overhead…

Machine Learning · Computer Science 2025-05-22 Xiangwen Zhuge , Xu Shen , Zeyu Wang , Fan Dang , Xuan Ding , Danyang Li , Yahui Han , Tianxiang Hao , Zheng Yang

Reordering GPU Kernel Launches to Enable Efficient Concurrent Execution

Contemporary GPUs allow concurrent execution of small computational kernels in order to prevent idling of GPU resources. Despite the potential concurrency between independent kernels, the order in which kernels are issued to the GPU will…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-11-26 Teng Li , Vikram K. Narayana , Tarek El-Ghazawi

Towards Efficient and Practical GPU Multitasking in the Era of LLM

GPU singletasking is becoming increasingly inefficient and unsustainable as hardware capabilities grow and workloads diversify. We are now at an inflection point where GPUs must embrace multitasking, much like CPUs did decades ago, to meet…

Operating Systems · Computer Science 2025-08-13 Jiarong Xing , Yifan Qiao , Simon Mo , Xingqi Cui , Gur-Eyal Sela , Yang Zhou , Joseph Gonzalez , Ion Stoica

An Online Fragmentation-Aware Scheduler for Managing GPU-Sharing Workloads on Multi-Instance GPUs

Modern GPU workloads increasingly demand efficient resource sharing, as many jobs do not require the full capacity of a GPU. Among sharing techniques, NVIDIA's Multi-Instance GPU (MIG) offers strong resource isolation by enabling…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-19 Hsu-Tzu Ting , Jerry Chou , Ming-Hung Chen , I-Hsin Chung

Taming GPU Underutilization via Static Partitioning and Fine-grained CPU Offloading

Advances in GPU compute throughput and memory capacity brings significant opportunities to a wide range of workloads. However, efficiently utilizing these resources remains challenging, particularly because diverse application…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-10 Gabin Schieffer , Ruimin Shi , Jie Ren , Ivy Peng