Related papers: SCOPE: Optimizing Key-Value Cache Compression in L…

Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation

Key-Value (KV) cache remains a major bottleneck for deploying Large Language Models (LLMs) in long-generation tasks. Prior work often applies uniform compression across both prefill and decoding caches, but compressing the prefill cache…

Artificial Intelligence · Computer Science 2026-05-29 Soumyadeep Jana , Sagar Nishad , Sanasam Ranbir Singh

KV Cache Compression for Inference Efficiency in LLMs: A Review

Withtherapid advancement of large language models (LLMs), the context length for inference has been continuously increasing, leading to an exponential growth in the demand for Key-Value (KV) caching. This has resulted in a significant…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-11 Yanyu Liu , Jingying Fu , Sixiang Liu , Yitian Zou , You Fu , Jiehan Zhou , Shouhua Zhang

KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache

Transformer-based large language models (LLMs) demonstrate impressive potential in various practical applications. However, long context inference poses a significant challenge due to the enormous memory requirements of the key-value (KV)…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-03 Bo Jiang , Taolue Yang , Youyuan Liu , Chengming Zhang , Xubin He , Sian Jin

Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving

Key-Value cache (\texttt{KV} \texttt{cache}) compression has emerged as a promising technique to optimize Large Language Model (LLM) serving. It primarily decreases the memory consumption of \texttt{KV} \texttt{cache} to reduce the…

Machine Learning · Computer Science 2025-04-01 Wei Gao , Xinyu Zhou , Peng Sun , Tianwei Zhang , Yonggang Wen

SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs

Transformer-based large language models (LLMs) have already achieved remarkable results on long-text tasks, but the limited GPU memory (VRAM) resources struggle to accommodate the linearly growing demand for key-value (KV) cache as the…

Computation and Language · Computer Science 2025-03-21 Shibo Jie , Yehui Tang , Kai Han , Zhi-Hong Deng , Jing Han

LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy

The Key-Value (KV) cache is a crucial component in serving transformer-based autoregressive large language models (LLMs), enabling faster inference by storing previously computed KV vectors. However, its memory consumption scales linearly…

Machine Learning · Computer Science 2024-10-07 Rongzhi Zhang , Kuang Wang , Liyuan Liu , Shuohang Wang , Hao Cheng , Chao Zhang , Yelong Shen

Key, Value, Compress: A Systematic Exploration of KV Cache Compression Techniques

Large language models (LLMs) have demonstrated exceptional capabilities in generating text, images, and video content. However, as context length grows, the computational cost of attention increases quadratically with the number of tokens,…

Computation and Language · Computer Science 2025-04-23 Neusha Javidnia , Bita Darvish Rouhani , Farinaz Koushanfar

Retrospective Sparse Attention for Efficient Long-Context Generation

Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory…

Computation and Language · Computer Science 2026-05-21 Seonghwan Choi , Beomseok Kang , Dongwon Jo , Jae-Joon Kim

Hold Onto That Thought: Assessing KV Cache Compression On Reasoning

Large language models (LLMs) have demonstrated remarkable performance on long-context tasks, but are often bottlenecked by memory constraints. Namely, the KV cache, which is used to significantly speed up attention computations, grows…

Computation and Language · Computer Science 2025-12-16 Minghui Liu , Aadi Palnitkar , Tahseen Rabbani , Hyunwoo Jae , Kyle Rui Sang , Dixi Yao , Shayan Shabihi , Fuheng Zhao , Tian Li , Ce Zhang , Furong Huang , Kunpeng Zhang

KV Cache Optimization Strategies for Scalable and Efficient LLM Inference

The key-value (KV) cache is a foundational optimization in Transformer-based large language models (LLMs), eliminating redundant recomputation of past token representations during autoregressive generation. However, its memory footprint…

Machine Learning · Computer Science 2026-03-24 Yichun Xu , Navjot K. Khaira , Tejinder Singh

KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity

The Key-Value (KV) cache is central to the efficiency of transformer-based large language models (LLMs), storing previously computed vectors to accelerate inference. Yet, as sequence length and batch size grow, the cache becomes a major…

Machine Learning · Computer Science 2025-12-08 Damien Lesens , Beheshteh T. Rakhshan , Guillaume Rabusseau

KVCompose: Efficient Structured KV Cache Compression with Composite Tokens

Large language models (LLMs) rely on key-value (KV) caches for efficient autoregressive decoding; however, cache size grows linearly with context length and model depth, becoming a major bottleneck in long-context inference. Prior KV cache…

Machine Learning · Computer Science 2025-09-22 Dmitry Akulov , Mohamed Sana , Antonio De Domenico , Tareq Si Salem , Nicola Piovesan , Fadhel Ayed

KV Cache Offloading for Context-Intensive Tasks

With the growing demand for long-context LLMs across a wide range of applications, the key-value (KV) cache has become a critical bottleneck for both latency and memory usage. Recently, KV-cache offloading has emerged as a promising…

Machine Learning · Computer Science 2026-05-18 Andrey Bocharnikov , Ivan Ermakov , Denis Kuznedelev , Vyacheslav Zhdanovskiy , Yegor Yershov

Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries

The Key-Value (KV) cache is crucial for efficient Large Language Models (LLMs) inference, but excessively long contexts drastically increase KV cache memory footprint. Existing KV cache compression methods typically rely on input-side…

Computation and Language · Computer Science 2026-03-13 Zhenxu Tian , Yi Su , Juntao Li , Min Zhang

LongFlow: Efficient KV Cache Compression for Reasoning Models

Recent reasoning models such as OpenAI-o1 and DeepSeek-R1 have shown strong performance on complex tasks including mathematical reasoning and code generation. However, this performance gain comes with substantially longer output sequences,…

Machine Learning · Computer Science 2026-04-28 Yi Su , Zhenxu Tian , Dan Qiao , Yuechi Zhou , Juntao Li , Min Zhang

ZoomR: Memory Efficient Reasoning through Multi-Granularity Key Value Retrieval

Large language models (LLMs) have shown great performance on complex reasoning tasks but often require generating long intermediate thoughts before reaching a final answer. During generation, LLMs rely on a key-value (KV) cache for…

Machine Learning · Computer Science 2026-04-15 David H. Yang , Yuxuan Zhu , Mohammad Mohammadi Amiri , Keerthiram Murugesan , Tejaswini Pedapati , Subhajit Chaudhury , Pin-Yu Chen

KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head

Context lengths of Large Language Models (LLMs) have exploded in recent years, with 128k-token context becoming a standard and million-token context becoming a reality. Efficiently supporting long-context inference remains challenging as…

Computation and Language · Computer Science 2024-10-08 Isaac Rehg

SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching

Large language models face significant computational and memory challenges when processing long contexts. During inference, efficient management of the key-value (KV) cache, which stores intermediate activations for autoregressive…

Computation and Language · Computer Science 2025-09-30 Yuxuan Zhu , Ali Falahati , David H. Yang , Mohammad Mohammadi Amiri

StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference

As Large Language Models (LLMs) scale to support context windows exceeding one million tokens, the linear growth of Key-Value (KV) cache imposes severe memory capacity and bandwidth bottlenecks, constraining the efficiency of long-context…

Computation and Language · Computer Science 2026-04-09 Zhirui Chen , Peiyang Liu , Ling Shao

Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression

Several works have developed eviction policies to remove key-value (KV) pairs from the KV cache for more efficient inference. The focus has been on compressing the KV cache after the input prompt has been processed for faster token…

Computation and Language · Computer Science 2025-07-04 Michael R. Metel , Boxing Chen , Mehdi Rezagholizadeh