Related papers: GPU Sharing with Triples Mode

Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving

Serving large language models (LLMs) is expensive, especially for providers hosting many models, making cost reduction essential. The unique workload patterns of serving multiple LLMs (i.e., multi-LLM serving) create new opportunities and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-14 Shan Yu , Jiarong Xing , Yifan Qiao , Mingyuan Ma , Yangmin Li , Yang Wang , Shuo Yang , Zhiqiang Xie , Shiyi Cao , Ke Bao , Ion Stoica , Harry Xu , Ying Sheng

Optimal Workload Placement on Multi-Instance GPUs

There is an urgent and pressing need to optimize usage of Graphical Processing Units (GPUs), which have arguably become one of the most expensive and sought after IT resources. To help with this goal, several of the current generation of…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-11 Bekir Turkkan , Pavankumar Murali , Pavithra Harsha , Rohan Arora , Gerard Vanloo , Chandra Narayanaswami

Taming GPU Underutilization via Static Partitioning and Fine-grained CPU Offloading

Advances in GPU compute throughput and memory capacity brings significant opportunities to a wide range of workloads. However, efficiently utilizing these resources remains challenging, particularly because diverse application…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-10 Gabin Schieffer , Ruimin Shi , Jie Ren , Ivy Peng

GPUnion: Autonomous GPU Sharing on Campus

A pronounced imbalance in GPU resources exists on campus, where some laboratories own underutilized servers while others lack the compute needed for AI research. GPU sharing can alleviate this disparity, while existing platforms typically…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-07 Yufang Li , Yuanbo Zhang , Hanlong Liao , Deke Guo , Guoming Tang

Design and Operation of Shared Machine Learning Clusters on Campus

Amid the rapid advancements in large machine learning (ML) models, universities worldwide are investing substantial funds and efforts into GPU clusters. However, managing a shared GPU cluster poses a pyramid of challenges, from hardware…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-15 Kaiqiang Xu , Decang Sun , Hao Wang , Zhenghang Ren , Xinchen Wan , Xudong Liao , Zilong Wang , Junxue Zhang , Kai Chen

Improving GPU Performance Through Resource Sharing

Graphics Processing Units (GPUs) consisting of Streaming Multiprocessors (SMs) achieve high throughput by running a large number of threads and context switching among them to hide execution latencies. The number of thread blocks, and hence…

Hardware Architecture · Computer Science 2015-06-08 Vishwesh Jatala , Jayvant Anantpur , Amey Karkare

Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications

GPU computing is becoming increasingly more popular with the proliferation of deep learning (DL) applications. However, unlike traditional resources such as CPU or the network, modern GPUs do not natively support fine-grained sharing…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-02-14 Peifeng Yu , Mosharaf Chowdhury

Intra-node Memory Safe GPU Co-Scheduling

GPUs in High-Performance Computing systems remain under-utilised due to the unavailability of schedulers that can safely schedule multiple applications to share the same GPU. The research reported in this paper is motivated to improve the…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-12-14 Carlos Reano , Federico Silla , Dimitrios S. Nikolopoulos , Blesson Varghese

Efficient Resource Sharing Through GPU Virtualization on Accelerated High Performance Computing Systems

The High Performance Computing (HPC) field is witnessing a widespread adoption of Graphics Processing Units (GPUs) as co-processors for conventional homogeneous clusters. The adoption of prevalent Single- Program Multiple-Data (SPMD)…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-11-25 Teng Li , Vikram K. Narayana , Tarek El-Ghazawi

On the Partitioning of GPU Power among Multi-Instances

Efficient power management in cloud data centers is essential for reducing costs, enhancing performance, and minimizing environmental impact. GPUs, critical for tasks like machine learning (ML) and GenAI, are major contributors to power…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-15 Tirth Vamja , Kaustabha Ray , Felix George , UmaMaheswari C Devi

MISO: Exploiting Multi-Instance GPU Capability on Multi-Tenant Systems for Machine Learning

GPU technology has been improving at an expedited pace in terms of size and performance, empowering HPC and AI/ML researchers to advance the scientific discovery process. However, this also leads to inefficient resource usage, as most GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-10-10 Baolin Li , Tirthak Patel , Siddarth Samsi , Vijay Gadepally , Devesh Tiwari

Towards Efficient and Practical GPU Multitasking in the Era of LLM

GPU singletasking is becoming increasingly inefficient and unsustainable as hardware capabilities grow and workloads diversify. We are now at an inflection point where GPUs must embrace multitasking, much like CPUs did decades ago, to meet…

Operating Systems · Computer Science 2025-08-13 Jiarong Xing , Yifan Qiao , Simon Mo , Xingqi Cui , Gur-Eyal Sela , Yang Zhou , Joseph Gonzalez , Ion Stoica

Techniques for Shared Resource Management in Systems with Throughput Processors

The continued growth of the computational capability of throughput processors has made throughput processors the platform of choice for a wide variety of high performance computing applications. Graphics Processing Units (GPUs) are a prime…

Hardware Architecture · Computer Science 2018-05-01 Rachata Ausavarungnirun

Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded Optimizers

Transformers and LLMs have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the training of transformers is slow and often takes in the order…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-18 Avinash Maurya , Jie Ye , M. Mustafa Rafique , Franck Cappello , Bogdan Nicolae

A Graph-based Model for GPU Caching Problems

Modeling data sharing in GPU programs is a challenging task because of the massive parallelism and complex data sharing patterns provided by GPU architectures. Better GPU caching efficiency can be achieved through careful task scheduling…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-10-04 Lingda Li , Ari B. Hayes , Stephen A. Hackler , Eddy Z. Zhang , Mario Szegedy , Shuaiwen Leon Song

Data-parallel distributed training of very large models beyond GPU capacity

GPUs have limited memory and it is difficult to train wide and/or deep models that cause the training process to go out of memory. It is shown in this paper how an open source tool called Large Model Support (LMS) can utilize a high…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-11-30 Samuel Matzek , Max Grossman , Minsik Cho , Anar Yusifov , Bryant Nelson , Amit Juneja

Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning

As machine learning techniques are applied to a widening range of applications, high throughput machine learning (ML) inference servers have become critical for online service applications. Such ML inference servers pose two challenges:…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-06 Seungbeom Choi , Sunho Lee , Yeonjae Kim , Jongse Park , Youngjin Kwon , Jaehyuk Huh

SeaLLM: Service-Aware and Latency-Optimized Resource Sharing for Large Language Model Inference

Large language models (LLMs) with different architectures and sizes have been developed. Serving each LLM with dedicated GPUs leads to resource waste and service inefficiency due to the varying demand of LLM requests. A common practice is…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-23 Yihao Zhao , Jiadun Chen , Peng Sun , Lei Li , Xuanzhe Liu , Xin Jin

The Landscape of GPU-Centric Communication

In recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks,…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-24 Didem Unat , Ilyas Turimbetov , Mohammed Kefah Taha Issa , Doğan Sağbili , Flavio Vella , Daniele De Sensi , Ismayil Ismayilov

Coordinated Cooling and Compute Management for AI Datacenters

The AI datacenters are currently being deployed on a large scale to support the training and deployment of power-intensive large-language models (LLMs). Extensive amount of computation and cooling required in datacenters increase concerns…

Systems and Control · Electrical Eng. & Systems 2026-01-14 Nardos Belay Abera , Yize Chen