English
Related papers

Related papers: Efficient Interactive LLM Serving with Proxy Model…

200 papers

In Large Language Model (LLM) inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS) scheduling strategy, leading to…

Machine Learning · Computer Science 2024-08-29 Yichao Fu , Siqi Zhu , Runlong Su , Aurick Qiao , Ion Stoica , Hao Zhang

Large Language Models (LLMs) represent a revolutionary advancement in the contemporary landscape of artificial general intelligence (AGI). As exemplified by ChatGPT, LLM-based applications necessitate minimal response latency and maximal…

Performance · Computer Science 2024-11-01 Youpeng Zhao , Jun Wang

Efficient scheduling of LLM inference tasks is essential for achieving low latency and high throughput, particularly with the growing use of reasoning-capable LLMs. Traditional strategies like First-Come-First-Serve (FCFS) often suffer from…

Machine Learning · Computer Science 2025-10-13 Yiheng Tao , Yihe Zhang , Matthew T. Dearing , Xin Wang , Yuping Fan , Zhiling Lan

Serving Large Language Models (LLMs) under mixed workloads--short, latency-sensitive interactive queries alongside long, throughput-oriented batch requests--poses a fundamental scheduling challenge. Standard First-Come, First-Served (FCFS)…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-30 Bronislav Sidik , Chaya Levi , Joseph Kampeas

We propose ELIS, a serving system for Large Language Models (LLMs) featuring an Iterative Shortest Remaining Time First (ISRTF) scheduler designed to efficiently manage inference tasks with the shortest remaining tokens. Current LLM serving…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-15 Seungbeom Choi , Jeonghoe Goo , Eunjoo Jeon , Mingyu Yang , Minsung Jang

To schedule LLM inference, the \textit{shortest job first} (SJF) principle is favorable by prioritizing requests with short output lengths to avoid head-of-line (HOL) blocking. Existing methods usually predict a single output length for…

Machine Learning · Computer Science 2026-05-26 Haoyu Zheng , Yongqiang Zhang , Fangcheng Fu , Xiaokai Zhou , Hao Luo , Hongchao Zhu , Yuanyuan Zhu , Hao Wang , Xiao Yan , Jiawei Jiang

Augmented Large Language Models (LLMs) enhance the capabilities of standalone LLMs by integrating external data sources through API calls. In interactive LLM applications, efficient scheduling is crucial for maintaining low request…

Machine Learning · Computer Science 2024-10-29 Rana Shahout , Cong Liang , Shiji Xin , Qianru Lao , Yong Cui , Minlan Yu , Michael Mitzenmacher

Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT. The interactive nature of these applications demands low latency for LLM inference. Existing LLM serving systems use…

Machine Learning · Computer Science 2024-09-26 Bingyang Wu , Yinmin Zhong , Zili Zhang , Shengyu Liu , Fangyue Liu , Yuanhang Sun , Gang Huang , Xuanzhe Liu , Xin Jin

Large Language Models (LLMs) have achieved remarkable success across a wide range of tasks, but serving them efficiently at scale remains a critical challenge due to their substantial computational and latency demands. While most existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-04 Yifan Sun , Gholamreza Haffari , Minxian Xu , Rajkumar Buyya , Adel N. Toosi

The exploration and application of Large Language Models (LLMs) is thriving. To reduce deployment costs, continuous batching has become an essential feature in current service frameworks. The effectiveness of continuous batching relies on…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-15 Ruihao Gong , Shihao Bai , Siyu Wu , Yunqian Fan , Zaijun Wang , Xiuhong Li , Hailong Yang , Xianglong Liu

Speculative decoding accelerates Large Language Model (LLM) inference by employing a small speculative model (SSM) to generate multiple candidate tokens and verify them using the LLM in parallel. This technique has been widely integrated…

Computation and Language · Computer Science 2025-05-26 Ruixiao Li , Fahao Chen , Peng Li

We study the problem of optimizing Large Language Model (LLM) inference scheduling to minimize total latency. LLM inference is an online and multi-task service process and also heavily energy consuming by which a pre-trained LLM processes…

Machine Learning · Computer Science 2025-09-03 Zixi Chen , Yinyu Ye , Zijie Zhou

Large Language Models have revolutionized natural language processing, yet serving them efficiently in data centers remains challenging due to mixed workloads comprising latency-sensitive (LS) and best-effort (BE) jobs. Existing inference…

Machine Learning · Computer Science 2025-03-13 Mohammad Siavashi , Faezeh Keshmiri Dindarloo , Dejan Kostic , Marco Chiesa

Large language models (LLMs) iteratively generate text token by token, with memory usage increasing with the length of generated token sequences. Since the request generation length is generally unpredictable, it is difficult to estimate…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-11 Ke Cheng , Wen Hu , Zhi Wang , Hongen Peng , Jianguo Li , Sheng Zhang

Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks. However, the inference process for LLMs comes with significant computational costs. In this paper, we propose an…

Computation and Language · Computer Science 2023-05-30 Zangwei Zheng , Xiaozhe Ren , Fuzhao Xue , Yang Luo , Xin Jiang , Yang You

Large language model (LLM) inference at the network edge is a promising serving paradigm that leverages distributed edge resources to run inference near users and enhance privacy. Existing edge-based LLM inference systems typically adopt…

Systems and Control · Electrical Eng. & Systems 2025-10-14 Bingjie Zhu , Zhixiong Chen , Liqiang Zhao , Hyundong Shin , Arumugam Nallanathan

We study offline scheduling for large language model (LLM) serving under a fixed KV-cache memory budget, where requests have heterogeneous prompt (prefill) and response (decode) lengths. Prompt tokens determine initial KV usage, and each…

Optimization and Control · Mathematics 2026-02-11 Meixuan Wang , Yinyu Ye , Zijie Zhou

Large Language Models (LLMs) such as GPT-4 and Llama3 can already comprehend complex commands and process diverse tasks. This advancement facilitates their application in controlling drones and robots for various tasks. However, existing…

Robotics · Computer Science 2024-12-30 Neiwen Ling , Guojun Chen , Lin Zhong

Scheduling policies in large-scale Automatic Speech Recognition (ASR) serving pipelines play a key role in determining end-to-end (E2E) latency. Yet, widely used serving engines rely on first-come-first-served (FCFS) scheduling, which…

Machine Learning · Computer Science 2026-03-13 Darshan Makwana , Yash Jogi , Harsh Kotta , Aayush Kubba

Performance prediction is a method to estimate the performance of Language Models (LMs) on various Natural Language Processing (NLP) tasks, mitigating computational costs associated with model capacity and data for fine-tuning. Our paper…

Computation and Language · Computer Science 2024-12-17 David Anugraha , Genta Indra Winata , Chenyue Li , Patrick Amadeus Irawan , En-Shiun Annie Lee
‹ Prev 1 2 3 10 Next ›