Related papers: Improving GPU Multi-Tenancy Through Dynamic Multi-…

An Online Fragmentation-Aware Scheduler for Managing GPU-Sharing Workloads on Multi-Instance GPUs

Modern GPU workloads increasingly demand efficient resource sharing, as many jobs do not require the full capacity of a GPU. Among sharing techniques, NVIDIA's Multi-Instance GPU (MIG) offers strong resource isolation by enabling…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-19 Hsu-Tzu Ting , Jerry Chou , Ming-Hung Chen , I-Hsin Chung

MACE: A Hybrid LLM Serving System with Colocated SLO-aware Continuous Retraining Alignment

Large language models (LLMs) deployed on edge servers are increasingly used in latency-sensitive applications such as personalized assistants, recommendation, and content moderation. However, the non-stationary nature of user data…

Machine Learning · Computer Science 2025-10-07 Yufei Li , Yu Fu , Yue Dong , Cong Liu

A Survey of Multi-Tenant Deep Learning Inference on GPU

Deep Learning (DL) models have achieved superior performance. Meanwhile, computing hardware like NVIDIA GPUs also demonstrated strong computing scaling trends with 2x throughput and memory bandwidth for each generation. With such strong…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-26 Fuxun Yu , Di Wang , Longfei Shangguan , Minjia Zhang , Chenchen Liu , Xiang Chen

An Online Fragmentation-Aware GPU Scheduler for Multi-Tenant MIG-based Clouds

The explosive growth of AI applications has created unprecedented demand for GPU resources. Cloud providers meet this demand through GPU-as-a-Service platforms that offer rentable GPU resources for running AI workloads. In this context, the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-25 Marco Zambianco , Lorenzo Fasol , Roberto Doriguzzi-Corin

An Analysis of Collocation on GPUs for Deep Learning Training

Deep learning training is an expensive process that extensively uses GPUs, but not all model training saturates modern powerful GPUs. Multi-Instance GPU (MIG) is a new technology introduced by NVIDIA that can partition a GPU to better-fit…

Machine Learning · Computer Science 2023-04-25 Ties Robroek , Ehsan Yousefzadeh-Asl-Miandoab , Pınar Tözün

MISO: Exploiting Multi-Instance GPU Capability on Multi-Tenant Systems for Machine Learning

GPU technology has been improving at an expedited pace in terms of size and performance, empowering HPC and AI/ML researchers to advance the scientific discovery process. However, this also leads to inefficient resource usage, as most GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-10-10 Baolin Li , Tirthak Patel , Siddarth Samsi , Vijay Gadepally , Devesh Tiwari

Improving Multi-Instance GPU Efficiency via Sub-Entry Sharing TLB Design

NVIDIA's Multi-Instance GPU (MIG) technology enables partitioning GPU computing power and memory into separate hardware instances, providing complete isolation including compute resources, caches, and memory. However, prior work identifies…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-30 Bingyao Li , Yueqi Wang , Tianyu Wang , Lieven Eeckhout , Jun Yang , Aamer Jaleel , Xulong Tang

A Multi-Objective Framework for Optimizing GPU-Enabled VM Placement in Cloud Data Centers with Multi-Instance GPU Technology

The extensive use of GPUs in cloud computing and the growing need for multitenancy have driven the development of innovative solutions for efficient GPU resource management. Multi-Instance GPU (MIG) technology from NVIDIA enables shared GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-05 Ahmad Siavashi , Mahmoud Momtazpour

GACER: Granularity-Aware ConcurrEncy Regulation for Multi-Tenant Deep Learning

As deep learning continues to advance and is applied to increasingly complex scenarios, the demand for concurrent deployment of multiple neural network models has arisen. This demand, commonly referred to as multi-tenant computing, is…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-04-25 Yongbo Yu , Fuxun Yu , Mingjia Zhang , Di Wang , Tolga Soyata , Chenchen Liu , Xiang Chen

PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers

In cloud machine learning (ML) inference systems, providing low latency to end-users is of utmost importance. However, maximizing server utilization and system throughput is also crucial for ML service providers as it helps lower the…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-01 Yunseong Kim , Yujeong Choi , Minsoo Rhu

Optimal Workload Placement on Multi-Instance GPUs

There is an urgent and pressing need to optimize usage of Graphical Processing Units (GPUs), which have arguably become one of the most expensive and sought after IT resources. To help with this goal, several of the current generation of…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-11 Bekir Turkkan , Pavankumar Murali , Pavithra Harsha , Rohan Arora , Gerard Vanloo , Chandra Narayanaswami

Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning

As machine learning techniques are applied to a widening range of applications, high throughput machine learning (ML) inference servers have become critical for online service applications. Such ML inference servers pose two challenges:…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-06 Seungbeom Choi , Sunho Lee , Yeonjae Kim , Jongse Park , Youngjin Kwon , Jaehyuk Huh

CoLLM: Continuous Adaptation for SLO-Aware LLM Serving on Shared GPU Clusters

As Large Language Models (LLMs) are increasingly adopted in edge intelligence to power domain-specific applications and personalized services, the quality and efficiency of the LLM post-training phase-including fine-tuning and inference,…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-19 Shaoyuan Huang , Yunfeng Zhao , Na Yan , Tiancheng Zhang , Xiaokai Wang , Xiaofei Wang , Wenyu Wang , Yansha Deng

Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections

Deep learning (DL) jobs use multi-dimensional parallelism, i.e. combining data, model, and pipeline parallelism, to use large GPU clusters efficiently. Long-running jobs may experience changes to their GPU allocation: (i) resource…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-27 Marcel Wagenländer , Guo Li , Bo Zhao , Luo Mai , Peter Pietzuch

Contention-Aware GPU Partitioning and Task-to-Partition Allocation for Real-Time Workloads

In order to satisfy timing constraints, modern real-time applications require massively parallel accelerators such as General Purpose Graphic Processing Units (GPGPUs). Generation after generation, the number of computing clusters made…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-24 Houssam-Eddine Zahaf , Ignacio Sanudo Olmedo , Jayati Singh , Nicola Capodieci , Sebastien Faucou

Continual Learners are Incremental Model Generalizers

Motivated by the efficiency and rapid convergence of pre-trained models for solving downstream tasks, this paper extensively studies the impact of Continual Learning (CL) models as pre-trainers. In both supervised and unsupervised CL, we…

Machine Learning · Computer Science 2023-06-22 Jaehong Yoon , Sung Ju Hwang , Yue Cao

Scaling On-Device GPU Inference for Large Generative Models

Driven by the advancements in generative AI, large machine learning models have revolutionized domains such as image processing, audio synthesis, and speech recognition. While server-based deployments remain the locus of peak performance,…

Machine Learning · Computer Science 2025-05-02 Jiuqiang Tang , Raman Sarokin , Ekaterina Ignasheva , Grant Jensen , Lin Chen , Juhyun Lee , Andrei Kulik , Matthias Grundmann

Chrion: Optimizing Recurrent Neural Network Inference by Collaboratively Utilizing CPUs and GPUs

Deploying deep learning models in cloud clusters provides efficient and prompt inference services to accommodate the widespread application of deep learning. These clusters are usually equipped with host CPUs and accelerators with distinct…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-07-24 Zinuo Cai , Hao Wang , Tao Song , Yang Hua , Ruhui Ma , Haibing Guan

iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud

GPUs are essential to accelerating the latency-sensitive deep neural network (DNN) inference workloads in cloud datacenters. To fully utilize GPU resources, spatial sharing of GPUs among co-located DNN inference workloads becomes…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-04 Fei Xu , Jianian Xu , Jiabin Chen , Li Chen , Ruitao Shang , Zhi Zhou , Fangming Liu

GEVO-ML: Optimizing Machine Learning Code with Evolutionary Computation

Parallel accelerators, such as GPUs, are key enablers for large-scale Machine Learning (ML) applications. However, ML model developers often lack detailed knowledge of the underlying system architectures, while system programmers usually do…

Machine Learning · Computer Science 2023-10-17 Jhe-Yu Liou , Stephanie Forrest , Carole-Jean Wu