English
Related papers

Related papers: Queue management for slo-oriented large language m…

200 papers

Queueing systems present many opportunities for applying machine-learning predictions, such as estimated service times, to improve system performance. This integration raises numerous open questions about how predictions can be effectively…

Artificial Intelligence · Computer Science 2025-03-11 Michael Mitzenmacher , Rana Shahout

Large language models (LLMs) have revolutionized applications such as code completion, chatbots, and online classification. To elevate user experiences, service level objectives (SLOs) serve as crucial benchmarks for assessing inference…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-13 Jinqi Huang , Yi Xiong , Xuebing Yu , Wenjie Huang , Entong Li , Li Zeng , Xin Chen

The Large Language Model (LLM) has gained significant popularity and is extensively utilized across various domains. Most LLM deployments occur within cloud data centers, where they encounter substantial response delays and incur high…

Artificial Intelligence · Computer Science 2024-06-21 Zhi Yao , Zhiqing Tang , Jiong Lou , Ping Shen , Weijia Jia

Serving systems for Large Language Models (LLMs) improve throughput by processing several requests concurrently. However, multiplexing hardware resources between concurrent requests involves non-trivial scheduling decisions. Practical…

Machine Learning · Computer Science 2025-01-29 Ferdi Kossmann , Bruce Fontaine , Daya Khudia , Michael Cafarella , Samuel Madden

Large language models (LLMs) with different architectures and sizes have been developed. Serving each LLM with dedicated GPUs leads to resource waste and service inefficiency due to the varying demand of LLM requests. A common practice is…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-23 Yihao Zhao , Jiadun Chen , Peng Sun , Lei Li , Xuanzhe Liu , Xin Jin

Large Language Models have revolutionized natural language processing, yet serving them efficiently in data centers remains challenging due to mixed workloads comprising latency-sensitive (LS) and best-effort (BE) jobs. Existing inference…

Machine Learning · Computer Science 2025-03-13 Mohammad Siavashi , Faezeh Keshmiri Dindarloo , Dejan Kostic , Marco Chiesa

The integration of Large Language Models (LLMs) into applications ranging from interactive chatbots to multi-agent systems has introduced a wide spectrum of service-level objectives (SLOs) for responsiveness. These include latency-sensitive…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-23 Wei Zhang , Zhiyu Wu , Yi Mu , Rui Ning , Banruo Liu , Nikhil Sarda , Myungjin Lee , Fan Lai

In the context of Machine Learning as a Service (MLaaS) clouds, the extensive use of Large Language Models (LLMs) often requires efficient management of significant query loads. When providing real-time inference services, several…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-25 Yiyuan He , Minxian Xu , Jingfeng Wu , Wanyi Zheng , Kejiang Ye , Chengzhong Xu

This paper introduces SLOs-Serve, a system designed for serving multi-stage large language model (LLM) requests with application- and stage-specific service level objectives (SLOs). The key idea behind SLOs-Serve is to customize the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-15 Siyuan Chen , Zhipeng Jia , Samira Khan , Arvind Krishnamurthy , Phillip B. Gibbons

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks…

Machine Learning · Computer Science 2023-09-13 Woosuk Kwon , Zhuohan Li , Siyuan Zhuang , Ying Sheng , Lianmin Zheng , Cody Hao Yu , Joseph E. Gonzalez , Hao Zhang , Ion Stoica

User experience is a critical factor Large Language Model (LLM) serving systems must consider, where service level objectives (SLOs) considering the experience of individual requests and system level metrics (SLMs) considering the overall…

Machine Learning · Computer Science 2025-10-30 Zhibin Wang , Shipeng Li , Yuhang Zhou , Xue Li , Zhonghui Zhang , Nguyen Cam-Tu , Rong Gu , Chen Tian , Guihai Chen , Sheng Zhong

Large language models (LLMs) have surged in popularity and are extensively used in commercial applications, where the efficiency of model serving is crucial for the user experience. Most current research focuses on optimizing individual…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-12 Yuhang Yao , Han Jin , Alay Dilipbhai Shah , Shanshan Han , Zijian Hu , Yide Ran , Dimitris Stripelis , Zhaozhuo Xu , Salman Avestimehr , Chaoyang He

Large language models (LLMs) iteratively generate text token by token, with memory usage increasing with the length of generated token sequences. Since the request generation length is generally unpredictable, it is difficult to estimate…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-11 Ke Cheng , Wen Hu , Zhi Wang , Hongen Peng , Jianguo Li , Sheng Zhang

In Large Language Model (LLM) inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS) scheduling strategy, leading to…

Machine Learning · Computer Science 2024-08-29 Yichao Fu , Siqi Zhu , Runlong Su , Aurick Qiao , Ion Stoica , Hao Zhang

Large Language Models (LLMs) demonstrate substantial potential across a diverse array of domains via request serving. However, as trends continue to push for expanding context sizes, the autoregressive nature of LLMs results in highly…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-08 Bin Lin , Chen Zhang , Tao Peng , Hanyu Zhao , Wencong Xiao , Minmin Sun , Anmin Liu , Zhipeng Zhang , Lanbo Li , Xiafei Qiu , Shen Li , Zhigang Ji , Tao Xie , Yong Li , Wei Lin

Inference serving for large language models (LLMs) is the key to unleashing their potential in people's daily lives. However, efficient LLM serving remains challenging today because the requests are inherently heterogeneous and…

Hardware Architecture · Computer Science 2024-06-07 Biao Sun , Ziming Huang , Hanyu Zhao , Wencong Xiao , Xinyi Zhang , Yong Li , Wei Lin

The increasing demand for Large Language Models (LLMs) across various applications has led to a significant shift in the design of deep learning serving systems. Deploying LLMs, particularly in multi-tenant environments, poses substantial…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-25 Bodun Hu , Jiamin Li , Le Xu , Myungjin Lee , Akshay Jajoo , Geon-Woo Kim , Hong Xu , Aditya Akella

The widespread adoption of Large Language Models (LLMs) has enabled diverse applications with very different latency requirements. Existing LLM serving frameworks rely on siloed infrastructure with coarse-grained workload segregation --…

Machine Learning · Computer Science 2025-03-31 Kanishk Goel , Jayashree Mohan , Nipun Kwatra , Ravi Shreyas Anupindi , Ramachandran Ramjee

Multimodal Large Language Models (MLLMs) power platforms like ChatGPT, Gemini, and Copilot, enabling richer interactions with text, images, and videos. These heterogeneous workloads introduce additional inference stages, such as vision…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-06 Konstantinos Papaioannou , Thaleia Dimitra Doudali

AI-enabled systems are subjected to various types of runtime uncertainties, ranging from dynamic workloads, resource requirements, model drift, etc. These uncertainties have a big impact on the overall Quality of Service (QoS). This is…

Software Engineering · Computer Science 2026-02-04 Hemang Jain , Divyansh Pandey , Karthik Vaidhyanathan
‹ Prev 1 2 3 10 Next ›