English
Related papers

Related papers: CacheFlow: Efficient LLM Serving with 3D-Parallel …

200 papers

In long-context Large Language Model (LLM) inference, the Time-To-First-Token (TTFT) latency incurred by the prefill stage has become the foremost bottleneck limiting interactive performance and deployment cost. KV Cache reuse offers a…

Hardware Architecture · Computer Science 2026-05-26 Fei li , Song Liu , Yan Liu , Jinhua Cui , Shiqiang Nie , Jinyu Wang , Weiguo Wu

Recent advances in long-text understanding have pushed the context length of large language models (LLMs) up to one million tokens. It boosts LLMs's accuracy and reasoning capacity but causes exorbitant computational costs and…

Computation and Language · Computer Science 2025-05-19 Huan Yang , Renji Zhang , Mingzhe Huang , Weijun Wang , Yin Tang , Yuanchun Li , Yunxin Liu , Deyu Zhang

The key-value (KV) cache is a foundational optimization in Transformer-based large language models (LLMs), eliminating redundant recomputation of past token representations during autoregressive generation. However, its memory footprint…

Machine Learning · Computer Science 2026-03-24 Yichun Xu , Navjot K. Khaira , Tejinder Singh

The growing complexity of LLM usage today, e.g., multi-round conversation and retrieval-augmented generation (RAG), makes contextual states (i.e., KV cache) reusable across user requests. Given the capacity constraints of GPU memory, only a…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-08 Shiwei Gao , Youmin Chen , Jiwu Shu

Prefix KV caching has become a key mechanism in LLM serving: it reduces time to first token (TTFT) by avoiding redundant computation across requests that share a prefix (i.e., the system prompt). However, the accumulated KV cache is often…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-25 Yu Zhu , Aditya Dhakal , Yunming Xiao , Dejan Milojicic , Gustavo Alonso

Large language model (LLM) based agentic workflows have become a popular paradigm for coordinating multiple specialized agents to solve complex tasks. To improve serving efficiency, existing LLM systems employ prefix caching to reuse…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-11 Zaifeng Pan , Ajjkumar Patel , Zhengding Hu , Yipeng Shen , Yue Guan , Wan-Lu Li , Lianhui Qin , Yida Wang , Yufei Ding

Serving long-context LLMs is challenging because request lengths and batch composition vary during token generation, causing the memory footprint to fluctuate significantly at runtime. Offloading KV caches to host memory limits effective…

Artificial Intelligence · Computer Science 2026-03-03 Xinyue Ma , Heelim Hong , Taegeon Um , Jongseop Lee , Seoyeong Choy , Woo-Yeon Lee , Myeongjae Jeon

The increasing complexity of AI tasks has shifted the paradigm from monolithic models toward multi-agent large language model (LLM) systems. However, these collaborative architectures introduce a critical bottleneck: redundant prefill…

Machine Learning · Computer Science 2026-03-17 Yingsheng Geng , Yuchong Gao , Weihong Wu , Guyue Liu , Jiang Liu

Real-time LLM interactions demand streamed token generations, where text tokens are progressively generated and delivered to users while balancing two objectives: responsiveness (i.e., low time-to-first-token) and steady generation…

Machine Learning · Computer Science 2025-10-06 Junyi Chen , Chuheng Du , Renyuan Liu , Shuochao Yao , Dingtian Yan , Jiang Liao , Shengzhong Liu , Fan Wu , Guihai Chen

Multi-modal Large Language Models (MLLMs) serving systems commonly employ KV-cache compression to reduce memory footprint. However, existing compression methods introduce significant processing overhead and queuing delays, particularly in…

Multimedia · Computer Science 2025-03-12 Jianian Zhu , Hang Wu , Haojie Wang , Yinghui Li , Biao Hou , Ruixuan Li , Jidong Zhai

In Text-to-SQL tasks, existing LLM-based methods often include extensive database schemas in prompts, leading to long context lengths and increased prefilling latency. While user queries typically focus on recurrent table sets-offering an…

Computation and Language · Computer Science 2026-01-14 Jinbo Su , Yuxuan Hu , Cuiping Li , Hong Chen , Jia Li , Lintao Ma , Jing Zhang

Large language models (LLMs) often incorporate multiple text chunks in their inputs to provide the necessary contexts. To speed up the prefill of the long LLM inputs, one can pre-compute the KV cache of a text and re-use the KV cache when…

Machine Learning · Computer Science 2025-04-07 Jiayi Yao , Hanchen Li , Yuhan Liu , Siddhant Ray , Yihua Cheng , Qizheng Zhang , Kuntai Du , Shan Lu , Junchen Jiang

Efficient inference with Large Language Models (LLMs) increasingly relies on Key-Value (KV) caches to store previously computed key and value vectors at each layer. These caches are essential to minimize redundant computation during…

Hardware Architecture · Computer Science 2026-04-08 Oteo Mamo , Olga Kogiou , Hyunjin Yi , Weikuan Yu

Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face a fundamental trade-off: prefix caching requires identical prefixes…

Machine Learning · Computer Science 2026-05-22 Bin Yang , Qiuyu Leng , Jun Zeng , Zhenhua Wu

Key-value (KV) cache memory management is the primary bottleneck limiting throughput and cost-efficiency in large-scale GPU inference serving. Current systems suffer from three compounding inefficiencies: (1) the absence of unified KV cache…

Hardware Architecture · Computer Science 2026-05-01 Sanjeev Rao Ganjihal

Efficiently serving Large Language Models (LLMs) with persistent Prefix Key-Value (KV) Cache is critical for applications like conversational search and multi-turn dialogue. Serving a request requires loading the pre-computed prefix KV…

Operating Systems · Computer Science 2026-01-21 Jing Zou , Shangyu Wu , Hancong Duan , Qiao Li , Chun Jason Xue

Key-value (KV) caching is critical for efficient inference in large language models (LLMs), yet its memory footprint scales linearly with context length, resulting in a severe scalability bottleneck. Existing approaches largely treat KV…

Computation and Language · Computer Science 2026-04-23 Gradwell Dzikanyanga , Weihao Yang , Hao Huang , Donglei Wu , Shihao Wang , Wen Xia , Sanjeeb K C

Long-form video question answering (VQA) overwhelms current vision-language models (VLMs) because attention and key-value (KV) caches grow with runtime, forcing either expensive inference or near-sighted sliding windows. We introduce…

Computer Vision and Pattern Recognition · Computer Science 2025-11-18 Shrenik Patel , Daivik Patel

The expanding context windows in large language models (LLMs) have greatly enhanced their capabilities in various applications, but they also introduce significant challenges in maintaining low latency, particularly in Time to First Token…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-10 Yi Xiong , Hao Wu , Changxu Shao , Ziqing Wang , Rui Zhang , Yuhong Guo , Junping Zhao , Ke Zhang , Zhenxuan Pan

KV cache management is essential for efficient LLM inference. To maximize utilization, existing inference engines evict finished requests' KV cache if new requests are waiting. This policy breaks for agentic workloads, which interleave LLM…

Operating Systems · Computer Science 2026-05-27 Hanchen Li , Runyuan He , Qiuyang Mang , Qizheng Zhang , Huanzhi Mao , Xiaokun Chen , Hangrui Zhou , Alvin Cheung , Joseph Gonzalez , Ion Stoica
‹ Prev 1 2 3 10 Next ›