Related papers: Queue management for slo-oriented large language m…

Queueing, Predictions, and LLMs: Challenges and Open Problems

Queueing systems present many opportunities for applying machine-learning predictions, such as estimated service times, to improve system performance. This integration raises numerous open questions about how predictions can be effectively…

Artificial Intelligence · Computer Science 2025-03-11 Michael Mitzenmacher , Rana Shahout

SLO-Aware Scheduling for Large Language Model Inferences

Large language models (LLMs) have revolutionized applications such as code completion, chatbots, and online classification. To elevate user experiences, service level objectives (SLOs) serve as crucial benchmarks for assessing inference…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-13 Jinqi Huang , Yi Xiong , Xuebing Yu , Wenjie Huang , Entong Li , Li Zeng , Xin Chen

VELO: A Vector Database-Assisted Cloud-Edge Collaborative LLM QoS Optimization Framework

The Large Language Model (LLM) has gained significant popularity and is extensively utilized across various domains. Most LLM deployments occur within cloud data centers, where they encounter substantial response delays and incur high…

Artificial Intelligence · Computer Science 2024-06-21 Zhi Yao , Zhiqing Tang , Jiong Lou , Ping Shen , Weijia Jia

Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs

Serving systems for Large Language Models (LLMs) improve throughput by processing several requests concurrently. However, multiplexing hardware resources between concurrent requests involves non-trivial scheduling decisions. Practical…

Machine Learning · Computer Science 2025-01-29 Ferdi Kossmann , Bruce Fontaine , Daya Khudia , Michael Cafarella , Samuel Madden

SeaLLM: Service-Aware and Latency-Optimized Resource Sharing for Large Language Model Inference

Large language models (LLMs) with different architectures and sizes have been developed. Serving each LLM with dedicated GPUs leads to resource waste and service inefficiency due to the varying demand of LLM requests. A common practice is…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-23 Yihao Zhao , Jiadun Chen , Peng Sun , Lei Li , Xuanzhe Liu , Xin Jin

Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference

Large Language Models have revolutionized natural language processing, yet serving them efficiently in data centers remains challenging due to mixed workloads comprising latency-sensitive (LS) and best-effort (BE) jobs. Existing inference…

Machine Learning · Computer Science 2025-03-13 Mohammad Siavashi , Faezeh Keshmiri Dindarloo , Dejan Kostic , Marco Chiesa

JITServe: SLO-aware LLM Serving with Imprecise Request Information

The integration of Large Language Models (LLMs) into applications ranging from interactive chatbots to multi-agent systems has introduced a wide spectrum of service-level objectives (SLOs) for responsiveness. These include latency-sensitive…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-23 Wei Zhang , Zhiyu Wu , Yi Mu , Rui Ning , Banruo Liu , Nikhil Sarda , Myungjin Lee , Fan Lai

UELLM: A Unified and Efficient Approach for LLM Inference Serving

In the context of Machine Learning as a Service (MLaaS) clouds, the extensive use of Large Language Models (LLMs) often requires efficient management of significant query loads. When providing real-time inference services, several…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-25 Yiyuan He , Minxian Xu , Jingfeng Wu , Wanyi Zheng , Kejiang Ye , Chengzhong Xu

SLOs-Serve: Optimized Serving of Multi-SLO LLMs

This paper introduces SLOs-Serve, a system designed for serving multi-stage large language model (LLM) requests with application- and stage-specific service level objectives (SLOs). The key idea behind SLOs-Serve is to customize the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-15 Siyuan Chen , Zhipeng Jia , Samira Khan , Arvind Krishnamurthy , Phillip B. Gibbons

Efficient Memory Management for Large Language Model Serving with PagedAttention

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks…

Machine Learning · Computer Science 2023-09-13 Woosuk Kwon , Zhuohan Li , Siyuan Zhuang , Ying Sheng , Lianmin Zheng , Cody Hao Yu , Joseph E. Gonzalez , Hao Zhang , Ion Stoica

Revisiting Service Level Objectives and System Level Metrics in Large Language Model Serving

User experience is a critical factor Large Language Model (LLM) serving systems must consider, where service level objectives (SLOs) considering the experience of individual requests and system level metrics (SLMs) considering the overall…

Machine Learning · Computer Science 2025-10-30 Zhibin Wang , Shipeng Li , Yuhang Zhou , Xue Li , Zhonghui Zhang , Nguyen Cam-Tu , Rong Gu , Chen Tian , Guihai Chen , Sheng Zhong

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

Large language models (LLMs) have surged in popularity and are extensively used in commercial applications, where the efficiency of model serving is crucial for the user experience. Most current research focuses on optimizing individual…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-12 Yuhang Yao , Han Jin , Alay Dilipbhai Shah , Shanshan Han , Zijian Hu , Yide Ran , Dimitris Stripelis , Zhaozhuo Xu , Salman Avestimehr , Chaoyang He

Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving

Large language models (LLMs) iteratively generate text token by token, with memory usage increasing with the length of generated token sequences. Since the request generation length is generally unpredictable, it is difficult to estimate…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-11 Ke Cheng , Wen Hu , Zhi Wang , Hongen Peng , Jianguo Li , Sheng Zhang

Efficient LLM Scheduling by Learning to Rank

In Large Language Model (LLM) inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS) scheduling strategy, leading to…

Machine Learning · Computer Science 2024-08-29 Yichao Fu , Siqi Zhu , Runlong Su , Aurick Qiao , Ion Stoica , Hao Zhang

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Large Language Models (LLMs) demonstrate substantial potential across a diverse array of domains via request serving. However, as trends continue to push for expanding context sizes, the autoregressive nature of LLMs results in highly…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-08 Bin Lin , Chen Zhang , Tao Peng , Hanyu Zhao , Wencong Xiao , Minmin Sun , Anmin Liu , Zhipeng Zhang , Lanbo Li , Xiafei Qiu , Shen Li , Zhigang Ji , Tao Xie , Yong Li , Wei Lin

Llumnix: Dynamic Scheduling for Large Language Model Serving

Inference serving for large language models (LLMs) is the key to unleashing their potential in people's daily lives. However, efficient LLM serving remains challenging today because the requests are inherently heterogeneous and…

Hardware Architecture · Computer Science 2024-06-07 Biao Sun , Ziming Huang , Hanyu Zhao , Wencong Xiao , Xinyi Zhang , Yong Li , Wei Lin

BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models

The increasing demand for Large Language Models (LLMs) across various applications has led to a significant shift in the design of deep learning serving systems. Deploying LLMs, particularly in multi-tenant environments, poses substantial…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-25 Bodun Hu , Jiamin Li , Le Xu , Myungjin Lee , Akshay Jajoo , Geon-Woo Kim , Hong Xu , Aditya Akella

Niyama : Breaking the Silos of LLM Inference Serving

The widespread adoption of Large Language Models (LLMs) has enabled diverse applications with very different latency requirements. Existing LLM serving frameworks rely on siloed infrastructure with coarse-grained workload segregation --…

Machine Learning · Computer Science 2025-03-31 Kanishk Goel , Jayashree Mohan , Nipun Kwatra , Ravi Shreyas Anupindi , Ramachandran Ramjee

TCM-Serve: Modality-aware Scheduling for Multimodal Large Language Model Inference

Multimodal Large Language Models (MLLMs) power platforms like ChatGPT, Gemini, and Copilot, enabling richer interactions with text, images, and videos. These heterogeneous workloads introduce additional inference stages, such as vision…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-06 Konstantinos Papaioannou , Thaleia Dimitra Doudali

CALM: A Self-Adaptive Orchestration Approach for QoS-Aware Routing in Small Language Model based Systems

AI-enabled systems are subjected to various types of runtime uncertainties, ranging from dynamic workloads, resource requirements, model drift, etc. These uncertainties have a big impact on the overall Quality of Service (QoS). This is…

Software Engineering · Computer Science 2026-02-04 Hemang Jain , Divyansh Pandey , Karthik Vaidhyanathan