Related papers: Optimal Workload Placement on Multi-Instance GPUs

An Analysis of Collocation on GPUs for Deep Learning Training

Deep learning training is an expensive process that extensively uses GPUs, but not all model training saturates modern powerful GPUs. Multi-Instance GPU (MIG) is a new technology introduced by NVIDIA that can partition a GPU to better-fit…

Machine Learning · Computer Science 2023-04-25 Ties Robroek , Ehsan Yousefzadeh-Asl-Miandoab , Pınar Tözün

Taming GPU Underutilization via Static Partitioning and Fine-grained CPU Offloading

Advances in GPU compute throughput and memory capacity brings significant opportunities to a wide range of workloads. However, efficiently utilizing these resources remains challenging, particularly because diverse application…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-10 Gabin Schieffer , Ruimin Shi , Jie Ren , Ivy Peng

A Multi-Objective Framework for Optimizing GPU-Enabled VM Placement in Cloud Data Centers with Multi-Instance GPU Technology

The extensive use of GPUs in cloud computing and the growing need for multitenancy have driven the development of innovative solutions for efficient GPU resource management. Multi-Instance GPU (MIG) technology from NVIDIA enables shared GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-05 Ahmad Siavashi , Mahmoud Momtazpour

MISO: Exploiting Multi-Instance GPU Capability on Multi-Tenant Systems for Machine Learning

GPU technology has been improving at an expedited pace in terms of size and performance, empowering HPC and AI/ML researchers to advance the scientific discovery process. However, this also leads to inefficient resource usage, as most GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-10-10 Baolin Li , Tirthak Patel , Siddarth Samsi , Vijay Gadepally , Devesh Tiwari

M\'elange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Large language models (LLMs) are increasingly integrated into many online services, yet they remain cost-prohibitive to deploy due to the requirement of expensive GPU instances. Prior work has addressed the high cost of LLM serving by…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-23 Tyler Griggs , Xiaoxuan Liu , Jiaxiang Yu , Doyoung Kim , Wei-Lin Chiang , Alvin Cheung , Ion Stoica

An Online Fragmentation-Aware Scheduler for Managing GPU-Sharing Workloads on Multi-Instance GPUs

Modern GPU workloads increasingly demand efficient resource sharing, as many jobs do not require the full capacity of a GPU. Among sharing techniques, NVIDIA's Multi-Instance GPU (MIG) offers strong resource isolation by enabling…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-19 Hsu-Tzu Ting , Jerry Chou , Ming-Hung Chen , I-Hsin Chung

On the Partitioning of GPU Power among Multi-Instances

Efficient power management in cloud data centers is essential for reducing costs, enhancing performance, and minimizing environmental impact. GPUs, critical for tasks like machine learning (ML) and GenAI, are major contributors to power…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-15 Tirth Vamja , Kaustabha Ray , Felix George , UmaMaheswari C Devi

Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach

GPU-based heterogeneous architectures are now commonly used in HPC clusters. Due to their architectural simplicity specialized for data-level parallelism, GPUs can offer much higher computational throughput and memory bandwidth than CPUs in…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-15 Urvij Saroliya , Eishi Arima , Dai Liu , Martin Schulz

HexiScale: Facilitating Large Language Model Training over Heterogeneous Hardware

Training large language models (LLMs) is a computationally intensive task, which is typically conducted in data centers with homogeneous high-performance GPUs. In this paper, we explore an alternative approach by deploying training…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-14 Ran Yan , Youhe Jiang , Xiaonan Nie , Fangcheng Fu , Bin Cui , Binhang Yuan

LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs

Fine-tuning pre-trained large language models (LLMs) with limited hardware presents challenges due to GPU memory constraints. Various distributed fine-tuning methods have been proposed to alleviate memory constraints on GPU. However,…

Artificial Intelligence · Computer Science 2024-04-18 Taeho Kim , Yanming Wang , Vatshank Chaturvedi , Lokesh Gupta , Seyeon Kim , Yongin Kwon , Sangtae Ha

Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs

Recent advancements in Large Language Models (LLMs) have led to increasingly diverse requests, accompanied with varying resource (compute and memory) demands to serve them. However, this in turn degrades the cost-efficiency of LLM serving…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-06 Youhe Jiang , Fangcheng Fu , Xiaozhe Yao , Guoliang He , Xupeng Miao , Ana Klimovic , Bin Cui , Binhang Yuan , Eiko Yoneki

Understanding GPU Resource Interference One Level Deeper

GPUs are vastly underutilized, even when running resource-intensive AI applications, as GPU kernels within each job have diverse resource profiles that may saturate some parts of a device while often leaving other parts idle. Colocating…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-17 Paul Elvinger , Foteini Strati , Natalie Enright Jerger , Ana Klimovic

Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving

Serving large language models (LLMs) is expensive, especially for providers hosting many models, making cost reduction essential. The unique workload patterns of serving multiple LLMs (i.e., multi-LLM serving) create new opportunities and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-14 Shan Yu , Jiarong Xing , Yifan Qiao , Mingyuan Ma , Yangmin Li , Yang Wang , Shuo Yang , Zhiqiang Xie , Shiyi Cao , Ke Bao , Ion Stoica , Harry Xu , Ying Sheng

Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning

As machine learning techniques are applied to a widening range of applications, high throughput machine learning (ML) inference servers have become critical for online service applications. Such ML inference servers pose two challenges:…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-06 Seungbeom Choi , Sunho Lee , Yeonjae Kim , Jongse Park , Youngjin Kwon , Jaehyuk Huh

Optimizing Resource Allocation for Geographically-Distributed Inference by Large Language Models

Large language models have demonstrated extraordinary performance in many AI tasks but are expensive to use, even after training, due to their requirement of high-end GPUs. Recently, a distributed system called PETALS was developed to lower…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-30 Tingyang Sun , Ting He , Bo Ji , Parimal Parag

Themis: Fair and Efficient GPU Cluster Scheduling

Modern distributed machine learning (ML) training workloads benefit significantly from leveraging GPUs. However, significant contention ensues when multiple such workloads are run atop a shared cluster of GPUs. A key question is how to…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-30 Kshiteej Mahajan , Arjun Balasubramanian , Arjun Singhvi , Shivaram Venkataraman , Aditya Akella , Amar Phanishayee , Shuchi Chawla

Fast Heterogeneous Serving: Scalable Mixed-Scale LLM Allocation for SLO-Constrained Inference

Deploying large language model (LLM) inference at scale requires jointly selecting base models, provisioning heterogeneous GPUs, configuring parallelism, and distributing workloads under tight latency, accuracy, and budget constraints.…

Machine Learning · Computer Science 2026-04-10 Jiaming Cheng , Duong Tung Nguyen

An Online Fragmentation-Aware GPU Scheduler for Multi-Tenant MIG-based Clouds

The explosive growth of AI applications has created unprecedented demand for GPU resources. Cloud providers meet this demand through GPU-as-a-Service platforms that offer rentable GPU resources for running AI workloads. In this context, the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-25 Marco Zambianco , Lorenzo Fasol , Roberto Doriguzzi-Corin

Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps

CPU-GPU heterogeneous systems are now commonly used in HPC (High-Performance Computing). However, improving the utilization and energy-efficiency of such systems is still one of the most critical issues. As one single program typically…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-08 Eishi Arima , Minjoon Kang , Issa Saba , Josef Weidendorfer , Carsten Trinitis , Martin Schulz

Profiling and optimization of multi-card GPU machine learning jobs

The effectiveness and efficiency of machine learning methodologies are crucial, especially with respect to the quality of results and computational cost. This paper discusses different model optimization techniques, providing a…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-30 Marcin Lawenda , Kyrylo Khloponin , Krzesimir Samborski , Łukasz Szustak