Related papers: FastCache: Optimizing Multimodal LLM Serving throu…

FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration

While large language models (LLMs) excel at handling long-context sequences, they require substantial prefill computation and key-value (KV) cache, which can heavily burden computational efficiency and memory usage in both prefill and…

Machine Learning · Computer Science 2026-04-21 Dongwon Jo , Jiwon Song , Yulhwa Kim , Jae-Joon Kim

Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving

Key-Value cache (\texttt{KV} \texttt{cache}) compression has emerged as a promising technique to optimize Large Language Model (LLM) serving. It primarily decreases the memory consumption of \texttt{KV} \texttt{cache} to reduce the…

Machine Learning · Computer Science 2025-04-01 Wei Gao , Xinyu Zhou , Peng Sun , Tianwei Zhang , Yonggang Wen

KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache

Transformer-based large language models (LLMs) demonstrate impressive potential in various practical applications. However, long context inference poses a significant challenge due to the enormous memory requirements of the key-value (KV)…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-03 Bo Jiang , Taolue Yang , Youyuan Liu , Chengming Zhang , Xubin He , Sian Jin

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging, as nothing can be generated until the whole context is…

Networking and Internet Architecture · Computer Science 2024-07-23 Yuhan Liu , Hanchen Li , Yihua Cheng , Siddhant Ray , Yuyang Huang , Qizheng Zhang , Kuntai Du , Jiayi Yao , Shan Lu , Ganesh Ananthanarayanan , Michael Maire , Henry Hoffmann , Ari Holtzman , Junchen Jiang

KV Cache Compression for Inference Efficiency in LLMs: A Review

Withtherapid advancement of large language models (LLMs), the context length for inference has been continuously increasing, leading to an exponential growth in the demand for Key-Value (KV) caching. This has resulted in a significant…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-11 Yanyu Liu , Jingying Fu , Sixiang Liu , Yitian Zou , You Fu , Jiehan Zhou , Shouhua Zhang

Joint Encoding of KV-Cache Blocks for Scalable LLM Serving

Modern large language models (LLMs) drive interactive AI systems but are bottlenecked by the memory-heavy growth of key-value (KV) caches, which limits real-time throughput under concurrent loads. Existing KV-cache compression methods rely…

Machine Learning · Computer Science 2026-01-07 Joseph Kampeas , Emir Haleva

PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression

Transformer-based large language models (LLMs) have demonstrated remarkable potential across a wide range of practical applications. However, long-context inference remains a significant challenge due to the substantial memory requirements…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-09 Bo Jiang , Taolue Yang , Youyuan Liu , Xubin He , Sheng Di , Sian Jin

MiniCache: KV Cache Compression in Depth Dimension for Large Language Models

A critical approach for efficiently deploying computationally demanding large language models (LLMs) is Key-Value (KV) caching. The KV cache stores key-value states of previously generated tokens, significantly reducing the need for…

Computation and Language · Computer Science 2024-09-10 Akide Liu , Jing Liu , Zizheng Pan , Yefei He , Gholamreza Haffari , Bohan Zhuang

AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving

Large language model (LLM) applications often reuse previously processed context, such as chat history and documents, which introduces significant redundant computation. Existing LLM serving systems address such redundant computation by…

Operating Systems · Computer Science 2026-01-19 Shaoting Feng , Hanchen Li , Kuntai Du , Zhuohan Gu , Yuhan Liu , Jiayi Yao , Siddhant Ray , Samuel Shen , Yihua Cheng , Ganesh Ananthanarayanan , Junchen Jiang

FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

Cost of serving large language models (LLM) is high, but the expensive and scarce GPUs are poorly efficient when generating tokens sequentially, unless the batch of sequences is enlarged. However, the batch size is limited by some…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-19 Jiaao He , Jidong Zhai

TokenCake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications

Large Language Models (LLMs) are increasingly deployed in complex multi-agent applications that rely on external function calls. This workload creates severe performance challenges for the KV Cache: spatial contention leads to the eviction…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-21 Zhuohang Bian , Feiyang Wu , Zhuoran Li , Teng Ma , Youwei Zhuo

KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head

Context lengths of Large Language Models (LLMs) have exploded in recent years, with 128k-token context becoming a standard and million-token context becoming a reality. Efficiently supporting long-context inference remains challenging as…

Computation and Language · Computer Science 2024-10-08 Isaac Rehg

VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration

Vision-Language Models (VLMs) have demonstrated impressive performance across a versatile set of tasks. A key challenge in accelerating VLMs is storing and accessing the large Key-Value (KV) cache that encodes long visual contexts, such as…

Computer Vision and Pattern Recognition · Computer Science 2024-11-01 Dezhan Tu , Danylo Vashchilenko , Yuzhe Lu , Panpan Xu

A Survey on Large Language Model Acceleration based on KV Cache Management

Large Language Models (LLMs) have revolutionized a wide range of domains such as natural language processing, computer vision, and multi-modal tasks due to their ability to comprehend context and perform logical reasoning. However, the…

Artificial Intelligence · Computer Science 2025-07-31 Haoyang Li , Yiming Li , Anxin Tian , Tianhao Tang , Zhanchao Xu , Xuejia Chen , Nicole Hu , Wei Dong , Qing Li , Lei Chen

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

In this study, we introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for Large Language Models (LLMs). Different from the conventional KV cache that retains key and…

Computation and Language · Computer Science 2024-10-31 Suyu Ge , Yunan Zhang , Liyuan Liu , Minjia Zhang , Jiawei Han , Jianfeng Gao

Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption

Large Language Models (LLMs), epitomized by ChatGPT's release in late 2022, have revolutionized various industries with their advanced language comprehension. However, their efficiency is challenged by the Transformer architecture's…

Computation and Language · Computer Science 2024-11-21 Luohe Shi , Hongyi Zhang , Yao Yao , Zuchao Li , Hai Zhao

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

LLMs are widely adopted in production, pushing inference systems to their limits. Disaggregated LLM serving (e.g., PD separation and KV state disaggregation) improves scalability and cost efficiency, but it also turns KV into an explicit…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-14 Zedong Liu , Xinyang Ma , Dejun Luo , Hairui Zhao , Bing Lu , Wenjing Huang , Yida Gu , Xingchen Liu , Zheng Wei , Jinyang Liu , Dingwen Tao , Guangming Tan

LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy

The Key-Value (KV) cache is a crucial component in serving transformer-based autoregressive large language models (LLMs), enabling faster inference by storing previously computed KV vectors. However, its memory consumption scales linearly…

Machine Learning · Computer Science 2024-10-07 Rongzhi Zhang , Kuang Wang , Liyuan Liu , Shuohang Wang , Hao Cheng , Chao Zhang , Yelong Shen

ZSMerge: Zero-Shot KV Cache Compression for Memory-Efficient Long-Context LLMs

The linear growth of key-value (KV) cache memory and quadratic computational in attention mechanisms complexity pose significant bottlenecks for large language models (LLMs) in long-context processing. While existing KV cache optimization…

Computation and Language · Computer Science 2025-10-07 Xin Liu , Xudong Wang , Pei Liu , Guoming Tang

KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse

Recent advances in long-text understanding have pushed the context length of large language models (LLMs) up to one million tokens. It boosts LLMs's accuracy and reasoning capacity but causes exorbitant computational costs and…

Computation and Language · Computer Science 2025-05-19 Huan Yang , Renji Zhang , Mingzhe Huang , Weijun Wang , Yin Tang , Yuanchun Li , Yunxin Liu , Deyu Zhang