English
Related papers

Related papers: Optimal Workload Placement on Multi-Instance GPUs

200 papers

Deep learning training is an expensive process that extensively uses GPUs, but not all model training saturates modern powerful GPUs. Multi-Instance GPU (MIG) is a new technology introduced by NVIDIA that can partition a GPU to better-fit…

Machine Learning · Computer Science 2023-04-25 Ties Robroek , Ehsan Yousefzadeh-Asl-Miandoab , Pınar Tözün

Advances in GPU compute throughput and memory capacity brings significant opportunities to a wide range of workloads. However, efficiently utilizing these resources remains challenging, particularly because diverse application…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-10 Gabin Schieffer , Ruimin Shi , Jie Ren , Ivy Peng

The extensive use of GPUs in cloud computing and the growing need for multitenancy have driven the development of innovative solutions for efficient GPU resource management. Multi-Instance GPU (MIG) technology from NVIDIA enables shared GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-05 Ahmad Siavashi , Mahmoud Momtazpour

GPU technology has been improving at an expedited pace in terms of size and performance, empowering HPC and AI/ML researchers to advance the scientific discovery process. However, this also leads to inefficient resource usage, as most GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-10-10 Baolin Li , Tirthak Patel , Siddarth Samsi , Vijay Gadepally , Devesh Tiwari

Large language models (LLMs) are increasingly integrated into many online services, yet they remain cost-prohibitive to deploy due to the requirement of expensive GPU instances. Prior work has addressed the high cost of LLM serving by…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-23 Tyler Griggs , Xiaoxuan Liu , Jiaxiang Yu , Doyoung Kim , Wei-Lin Chiang , Alvin Cheung , Ion Stoica

Modern GPU workloads increasingly demand efficient resource sharing, as many jobs do not require the full capacity of a GPU. Among sharing techniques, NVIDIA's Multi-Instance GPU (MIG) offers strong resource isolation by enabling…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-19 Hsu-Tzu Ting , Jerry Chou , Ming-Hung Chen , I-Hsin Chung

Efficient power management in cloud data centers is essential for reducing costs, enhancing performance, and minimizing environmental impact. GPUs, critical for tasks like machine learning (ML) and GenAI, are major contributors to power…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-15 Tirth Vamja , Kaustabha Ray , Felix George , UmaMaheswari C Devi

GPU-based heterogeneous architectures are now commonly used in HPC clusters. Due to their architectural simplicity specialized for data-level parallelism, GPUs can offer much higher computational throughput and memory bandwidth than CPUs in…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-15 Urvij Saroliya , Eishi Arima , Dai Liu , Martin Schulz

Training large language models (LLMs) is a computationally intensive task, which is typically conducted in data centers with homogeneous high-performance GPUs. In this paper, we explore an alternative approach by deploying training…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-14 Ran Yan , Youhe Jiang , Xiaonan Nie , Fangcheng Fu , Bin Cui , Binhang Yuan

Fine-tuning pre-trained large language models (LLMs) with limited hardware presents challenges due to GPU memory constraints. Various distributed fine-tuning methods have been proposed to alleviate memory constraints on GPU. However,…

Artificial Intelligence · Computer Science 2024-04-18 Taeho Kim , Yanming Wang , Vatshank Chaturvedi , Lokesh Gupta , Seyeon Kim , Yongin Kwon , Sangtae Ha

Recent advancements in Large Language Models (LLMs) have led to increasingly diverse requests, accompanied with varying resource (compute and memory) demands to serve them. However, this in turn degrades the cost-efficiency of LLM serving…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-06 Youhe Jiang , Fangcheng Fu , Xiaozhe Yao , Guoliang He , Xupeng Miao , Ana Klimovic , Bin Cui , Binhang Yuan , Eiko Yoneki

GPUs are vastly underutilized, even when running resource-intensive AI applications, as GPU kernels within each job have diverse resource profiles that may saturate some parts of a device while often leaving other parts idle. Colocating…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-17 Paul Elvinger , Foteini Strati , Natalie Enright Jerger , Ana Klimovic

Serving large language models (LLMs) is expensive, especially for providers hosting many models, making cost reduction essential. The unique workload patterns of serving multiple LLMs (i.e., multi-LLM serving) create new opportunities and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-14 Shan Yu , Jiarong Xing , Yifan Qiao , Mingyuan Ma , Yangmin Li , Yang Wang , Shuo Yang , Zhiqiang Xie , Shiyi Cao , Ke Bao , Ion Stoica , Harry Xu , Ying Sheng

As machine learning techniques are applied to a widening range of applications, high throughput machine learning (ML) inference servers have become critical for online service applications. Such ML inference servers pose two challenges:…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-06 Seungbeom Choi , Sunho Lee , Yeonjae Kim , Jongse Park , Youngjin Kwon , Jaehyuk Huh

Large language models have demonstrated extraordinary performance in many AI tasks but are expensive to use, even after training, due to their requirement of high-end GPUs. Recently, a distributed system called PETALS was developed to lower…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-30 Tingyang Sun , Ting He , Bo Ji , Parimal Parag

Modern distributed machine learning (ML) training workloads benefit significantly from leveraging GPUs. However, significant contention ensues when multiple such workloads are run atop a shared cluster of GPUs. A key question is how to…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-30 Kshiteej Mahajan , Arjun Balasubramanian , Arjun Singhvi , Shivaram Venkataraman , Aditya Akella , Amar Phanishayee , Shuchi Chawla

Deploying large language model (LLM) inference at scale requires jointly selecting base models, provisioning heterogeneous GPUs, configuring parallelism, and distributing workloads under tight latency, accuracy, and budget constraints.…

Machine Learning · Computer Science 2026-04-10 Jiaming Cheng , Duong Tung Nguyen

The explosive growth of AI applications has created unprecedented demand for GPU resources. Cloud providers meet this demand through GPU-as-a-Service platforms that offer rentable GPU resources for running AI workloads. In this context, the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-25 Marco Zambianco , Lorenzo Fasol , Roberto Doriguzzi-Corin

CPU-GPU heterogeneous systems are now commonly used in HPC (High-Performance Computing). However, improving the utilization and energy-efficiency of such systems is still one of the most critical issues. As one single program typically…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-08 Eishi Arima , Minjoon Kang , Issa Saba , Josef Weidendorfer , Carsten Trinitis , Martin Schulz

The effectiveness and efficiency of machine learning methodologies are crucial, especially with respect to the quality of results and computational cost. This paper discusses different model optimization techniques, providing a…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-30 Marcin Lawenda , Kyrylo Khloponin , Krzesimir Samborski , Łukasz Szustak
‹ Prev 1 2 3 10 Next ›