Related papers: Multi-model Machine Learning Inference Serving wit…

ML Inference Scheduling with Predictable Latency

Machine learning (ML) inference serving systems can schedule requests to improve GPU utilization and to meet service level objectives (SLOs) or deadlines. However, improving GPU utilization may compromise latency-sensitive scheduling, as…

Machine Learning · Computer Science 2025-12-25 Haidong Zhao , Nikolaos Georgantas

UELLM: A Unified and Efficient Approach for LLM Inference Serving

In the context of Machine Learning as a Service (MLaaS) clouds, the extensive use of Large Language Models (LLMs) often requires efficient management of significant query loads. When providing real-time inference services, several…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-25 Yiyuan He , Minxian Xu , Jingfeng Wu , Wanyi Zheng , Kejiang Ye , Chengzhong Xu

PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers

In cloud machine learning (ML) inference systems, providing low latency to end-users is of utmost importance. However, maximizing server utilization and system throughput is also crucial for ML service providers as it helps lower the…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-01 Yunseong Kim , Yujeong Choi , Minsoo Rhu

A Survey of Serverless Machine Learning Model Inference

Recent developments in Generative AI, Computer Vision, and Natural Language Processing have led to an increased integration of AI models into various products. This widespread adoption of AI requires significant efforts in deploying these…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-23 Kamil Kojs

Towards Resource-Efficient Serverless LLM Inference with SLINFER

The rise of LLMs has driven demand for private serverless deployments, characterized by moderate-sized models and infrequent requests. While existing serverless solutions follow exclusive GPU allocation, we take a step back to explore…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-16 Chuhao Xu , Zijun Li , Quan Chen , Han Zhao , Xueyan Tang , Minyi Guo

Optimizing Resource Allocation for Geographically-Distributed Inference by Large Language Models

Large language models have demonstrated extraordinary performance in many AI tasks but are expensive to use, even after training, due to their requirement of high-end GPUs. Recently, a distributed system called PETALS was developed to lower…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-30 Tingyang Sun , Ting He , Bo Ji , Parimal Parag

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Large language models (LLMs) have been a disruptive innovation in recent years, and they play a crucial role in our daily lives due to their ability to understand and generate human-like text. Their capabilities include natural language…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-17 Akrit Mudvari , Yuang Jiang , Leandros Tassiulas

Dynamic Space-Time Scheduling for GPU Inference

Serving deep neural networks in latency critical interactive settings often requires GPU acceleration. However, the small batch sizes typical in online inference results in poor GPU utilization, a potential performance gap which GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-01-03 Paras Jain , Xiangxi Mo , Ajay Jain , Harikaran Subbaraj , Rehan Sohail Durrani , Alexey Tumanov , Joseph Gonzalez , Ion Stoica

ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud Environments

In cloud environments, GPU-based deep neural network (DNN) inference servers are required to meet the Service Level Objective (SLO) latency for each workload under a specified request rate, while also minimizing GPU resource consumption.…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-24 Munkyu Lee , Sihoon Seong , Minki Kang , Jihyuk Lee , Gap-Joo Na , In-Geol Chun , Dimitrios Nikolopoulos , Cheol-Ho Hong

LeMix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems

Modern deployment of large language models (LLMs) frequently involves both inference serving and continuous retraining to stay aligned with evolving data and user feedback. Common practices separate these workloads onto distinct servers in…

Artificial Intelligence · Computer Science 2025-07-30 Yufei Li , Zexin Li , Yinglun Zhu , Cong Liu

SneakPeek: Data-Aware Model Selection and Scheduling for Inference Serving on the Edge

Modern applications increasingly rely on inference serving systems to provide low-latency insights with a diverse set of machine learning models. Existing systems often utilize resource elasticity to scale with demand. However, many…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-13 Joel Wolfrath , Daniel Frink , Abhishek Chandra

Strait: Perceiving Priority and Interference in ML Inference Serving

Machine learning (ML) inference serving systems host deep neural network (DNN) models and schedule incoming inference requests across deployed GPUs. However, limited support for task prioritization and insufficient latency estimation under…

Machine Learning · Computer Science 2026-05-01 Haidong Zhao , Nikolaos Georgantas

Optimizing LLM Inference Throughput via Memory-aware and SLA-constrained Dynamic Batching

The increasing adoption of large language models (LLMs) necessitates inference serving systems that can deliver both high throughput and low latency. Deploying LLMs with hundreds of billions of parameters on memory-constrained GPUs exposes…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-10 Bowen Pang , Kai Li , Feifan Wang

LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization

Recent breakthroughs in Large-scale language models (LLMs) have demonstrated impressive performance on various tasks. The immense sizes of LLMs have led to very high resource demand and cost for running the models. Though the models are…

Machine Learning · Computer Science 2024-03-05 Juntao Zhao , Borui Wan , Yanghua Peng , Haibin Lin , Chuan Wu

DeServe: Towards Affordable Offline LLM Inference via Decentralization

The rapid growth of generative AI and its integration into everyday workflows have significantly increased the demand for large language model (LLM) inference services. While proprietary models remain popular, recent advancements in…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-28 Linyu Wu , Xiaoyuan Liu , Tianneng Shi , Zhe Ye , Dawn Song

ECLIP: Energy-efficient and Practical Co-Location of ML Inference on Spatially Partitioned GPUs

As AI inference becomes mainstream, research has begun to focus on improving the energy consumption of inference servers. Inference kernels commonly underutilize a GPU's compute resources and waste power from idling components. To improve…

Systems and Control · Electrical Eng. & Systems 2025-06-17 Ryan Quach , Yidi Wang , Ali Jahanshahi , Daniel Wong , Hyoseung Kim

MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints

Mixture of Experts (MoE) LLMs, characterized by their sparse activation patterns, offer a promising approach to scaling language models while avoiding proportionally increasing the inference cost. However, their large parameter sizes…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-15 Yichao Yuan , Lin Ma , Nishil Talati

Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy Scaling

The rapid adoption of machine learning (ML) has underscored the importance of serving ML models with high throughput and resource efficiency. Traditional approaches to managing increasing query demands have predominantly focused on hardware…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-08 Sohaib Ahmad , Hui Guan , Ramesh K. Sitaraman

A Survey of Multi-Tenant Deep Learning Inference on GPU

Deep Learning (DL) models have achieved superior performance. Meanwhile, computing hardware like NVIDIA GPUs also demonstrated strong computing scaling trends with 2x throughput and memory bandwidth for each generation. With such strong…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-26 Fuxun Yu , Di Wang , Longfei Shangguan , Minjia Zhang , Chenchen Liu , Xiang Chen

Understanding Capacity-Driven Scale-Out Neural Recommendation Inference

Deep learning recommendation models have grown to the terabyte scale. Traditional serving schemes--that load entire models to a single server--are unable to support this scale. One approach to support this scale is with distributed serving,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-13 Michael Lui , Yavuz Yetim , Özgür Özkan , Zhuoran Zhao , Shin-Yeh Tsai , Carole-Jean Wu , Mark Hempstead