Related papers: LongFlow: Efficient KV Cache Compression for Reaso…

Hold Onto That Thought: Assessing KV Cache Compression On Reasoning

Large language models (LLMs) have demonstrated remarkable performance on long-context tasks, but are often bottlenecked by memory constraints. Namely, the KV cache, which is used to significantly speed up attention computations, grows…

Computation and Language · Computer Science 2025-12-16 Minghui Liu , Aadi Palnitkar , Tahseen Rabbani , Hyunwoo Jae , Kyle Rui Sang , Dixi Yao , Shayan Shabihi , Fuheng Zhao , Tian Li , Ce Zhang , Furong Huang , Kunpeng Zhang

CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding

Long-form video question answering (VQA) overwhelms current vision-language models (VLMs) because attention and key-value (KV) caches grow with runtime, forcing either expensive inference or near-sighted sliding windows. We introduce…

Computer Vision and Pattern Recognition · Computer Science 2025-11-18 Shrenik Patel , Daivik Patel

MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning

Long Chain-of-Thought (CoT) reasoning has significantly advanced the capabilities of Large Language Models (LLMs), but this progress is accompanied by substantial memory and latency overhead from the extensive Key-Value (KV) cache. Although…

Machine Learning · Computer Science 2025-12-23 Tao Zhang , Ziqian Zeng , Hao Peng , Huiping Zhuang , Cen Chen

R-KV: Redundancy-aware KV Cache Compression for Reasoning Models

Reasoning models have demonstrated impressive performance in self-reflection and chain-of-thought reasoning. However, they often produce excessively long outputs, leading to prohibitively large key-value (KV) caches during inference. While…

Computation and Language · Computer Science 2026-01-23 Zefan Cai , Wen Xiao , Hanshi Sun , Cheng Luo , Yikai Zhang , Ke Wan , Yucheng Li , Yeyang Zhou , Li-Wen Chang , Jiuxiang Gu , Zhen Dong , Anima Anandkumar , Abedelkadir Asi , Junjie Hu

KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache

Transformer-based large language models (LLMs) demonstrate impressive potential in various practical applications. However, long context inference poses a significant challenge due to the enormous memory requirements of the key-value (KV)…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-03 Bo Jiang , Taolue Yang , Youyuan Liu , Chengming Zhang , Xubin He , Sian Jin

KV Cache Compression for Inference Efficiency in LLMs: A Review

Withtherapid advancement of large language models (LLMs), the context length for inference has been continuously increasing, leading to an exponential growth in the demand for Key-Value (KV) caching. This has resulted in a significant…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-11 Yanyu Liu , Jingying Fu , Sixiang Liu , Yitian Zou , You Fu , Jiehan Zhou , Shouhua Zhang

Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning

Key-Value (KV) caching is a common technique to enhance the computational efficiency of Large Language Models (LLMs), but its memory overhead grows rapidly with input length. Prior work has shown that not all tokens are equally important…

Computation and Language · Computer Science 2025-10-24 Yu Fu , Zefan Cai , Abedelkadir Asi , Wayne Xiong , Yue Dong , Wen Xiao

ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing

Large Reasoning Models (LRMs) are becoming integral to many AI inference systems, enhancing their capabilities with advanced reasoning. However, deploying these models in production environments presents a significant QoS challenge: the…

Machine Learning · Computer Science 2026-05-15 Kaiwen Chen , Xin Tan , Minchen Yu , Jingzong Li , Hong Xu

Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression

While Key-Value (KV) cache compression is essential for efficient LLM inference, current evaluations disproportionately focus on sparse retrieval tasks, potentially masking the degradation of High-Density Reasoning where Chain-of-Thought…

Computation and Language · Computer Science 2026-05-13 Xiang Liu , Zhenheng Tang , Hong Chen , Peijie Dong , Zeyu Li , Xiuze Zhou , Bo Li , Xuming Hu , Xiaowen Chu

KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head

Context lengths of Large Language Models (LLMs) have exploded in recent years, with 128k-token context becoming a standard and million-token context becoming a reality. Efficiently supporting long-context inference remains challenging as…

Computation and Language · Computer Science 2024-10-08 Isaac Rehg

Which Heads Matter for Reasoning? RL-Guided KV Cache Compression

Reasoning large language models exhibit complex reasoning behaviors via extended chain-of-thought generation that are highly fragile to information loss during decoding, creating critical challenges for KV cache compression. Existing…

Computation and Language · Computer Science 2026-05-28 Wenjie Du , Li Jiang , Keda Tao , Xue Liu , Huan Wang

OrbitFlow: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration

Serving long-context LLMs is challenging because request lengths and batch composition vary during token generation, causing the memory footprint to fluctuate significantly at runtime. Offloading KV caches to host memory limits effective…

Artificial Intelligence · Computer Science 2026-03-03 Xinyue Ma , Heelim Hong , Taegeon Um , Jongseop Lee , Seoyeong Choy , Woo-Yeon Lee , Myeongjae Jeon

Retrospective Sparse Attention for Efficient Long-Context Generation

Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory…

Computation and Language · Computer Science 2026-05-21 Seonghwan Choi , Beomseok Kang , Dongwon Jo , Jae-Joon Kim

KVCompose: Efficient Structured KV Cache Compression with Composite Tokens

Large language models (LLMs) rely on key-value (KV) caches for efficient autoregressive decoding; however, cache size grows linearly with context length and model depth, becoming a major bottleneck in long-context inference. Prior KV cache…

Machine Learning · Computer Science 2025-09-22 Dmitry Akulov , Mohamed Sana , Antonio De Domenico , Tareq Si Salem , Nicola Piovesan , Fadhel Ayed

KVCrush: Key value cache size-reduction using similarity in head-behaviour

Key-value (KV) caching has emerged as a crucial optimization technique for accelerating inference in large language models (LLMs). By allowing the attention operation to scale linearly rather than quadratically with the total sequence…

Computation and Language · Computer Science 2026-01-06 Gopi Krishna Jha , Sameh Gobriel , Liubov Talamanova , Nilesh Jain

HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs

Processing long-context inputs with large language models presents a significant challenge due to the enormous memory requirements of the Key-Value (KV) cache during inference. Existing KV cache compression methods exhibit noticeable…

Computation and Language · Computer Science 2025-07-29 Dongquan Yang , Yifan Yang , Xiaotian Yu , Xianbiao Qi , Rong Xiao

KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows

Large language model (LLM) based agentic workflows have become a popular paradigm for coordinating multiple specialized agents to solve complex tasks. To improve serving efficiency, existing LLM systems employ prefix caching to reuse…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-11 Zaifeng Pan , Ajjkumar Patel , Zhengding Hu , Yipeng Shen , Yue Guan , Wan-Lu Li , Lianhui Qin , Yida Wang , Yufei Ding

Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving

Key-Value cache (\texttt{KV} \texttt{cache}) compression has emerged as a promising technique to optimize Large Language Model (LLM) serving. It primarily decreases the memory consumption of \texttt{KV} \texttt{cache} to reduce the…

Machine Learning · Computer Science 2025-04-01 Wei Gao , Xinyu Zhou , Peng Sun , Tianwei Zhang , Yonggang Wen

ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration

Large language models (LLMs) have demonstrated remarkable performance, but their long-context reasoning remains constrained by the excessive memory required for the Key-Value (KV) cache. This makes KV cache compression a critical step…

Machine Learning · Computer Science 2025-09-30 Xianglong Yan , Zhiteng Li , Tianao Zhang , Haotong Qin , Linghe Kong , Yulun Zhang , Xiaokang Yang

KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference

Supporting long-context LLMs is challenging due to the substantial memory demands of the key-value (KV) cache. Existing offloading systems store the full cache in host memory and selectively fetch critical entries during decoding, but this…

Computation and Language · Computer Science 2026-05-19 Jian Lin , Jiazhi Mi , Zicong Hong , Haodong Wang , Qianli Liu , Haodyue Zhang , Peng Li , Song Guo