Related papers: CacheFocus: Dynamic Cache Re-Positioning for Effic…

A Survey on Large Language Model Acceleration based on KV Cache Management

Large Language Models (LLMs) have revolutionized a wide range of domains such as natural language processing, computer vision, and multi-modal tasks due to their ability to comprehend context and perform logical reasoning. However, the…

Artificial Intelligence · Computer Science 2025-07-31 Haoyang Li , Yiming Li , Anxin Tian , Tianhao Tang , Zhanchao Xu , Xuejia Chen , Nicole Hu , Wei Dong , Qing Li , Lei Chen

Retrospective Sparse Attention for Efficient Long-Context Generation

Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory…

Computation and Language · Computer Science 2026-05-21 Seonghwan Choi , Beomseok Kang , Dongwon Jo , Jae-Joon Kim

HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference

The linear memory growth of the KV cache poses a significant bottleneck for LLM inference in long-context tasks. Existing static compression methods often fail to preserve globally important information. Although recent dynamic retrieval…

Computation and Language · Computer Science 2026-04-21 Zhiyuan Shi , Qibo Qiu , Feng Xue , Zhonglin Jiang , Li Yu , Jian Jiang , Xiaofei He , Wenxiao Wang

Exploring Fine-Tuning for In-Context Retrieval and Efficient KV-Caching in Long-Context Language Models

With context windows of millions of tokens, Long-Context Language Models (LCLMs) can encode entire document collections, offering a strong alternative to conventional retrieval-augmented generation (RAG). However, it remains unclear whether…

Computation and Language · Computer Science 2026-01-27 Francesco Maria Molfese , Momchil Hardalov , Rexhina Blloshmi , Bill Byrne , Adrià de Gispert

Training-Free Exponential Context Extension via Cascading KV Cache

The transformer's context window is vital for tasks such as few-shot learning and conditional generation as it preserves previous tokens for active memory. However, as the context lengths increase, the computational costs grow…

Computation and Language · Computer Science 2025-04-01 Jeffrey Willette , Heejun Lee , Youngwan Lee , Myeongjae Jeon , Sung Ju Hwang

LoMA: Lossless Compressed Memory Attention

Large Language Models (LLMs) face limitations due to the high demand on GPU memory and computational resources when handling long contexts. While sparsify the Key-Value (KV) cache of transformer model is a typical strategy to alleviate…

Machine Learning · Computer Science 2024-02-06 Yumeng Wang , Zhenyang Xiao

LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference

Long-context Multimodal Large Language Models (MLLMs) demand substantial computational resources for inference as the growth of their multimodal Key-Value (KV) cache, in response to increasing input lengths, challenges memory and time…

Computation and Language · Computer Science 2024-06-27 Zhongwei Wan , Ziang Wu , Che Liu , Jinfa Huang , Zhihong Zhu , Peng Jin , Longyue Wang , Li Yuan

LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models

Recent advancements in Large Language Models (LLMs) have spurred interest in numerous applications requiring robust long-range capabilities, essential for processing extensive input contexts and continuously generating extended outputs. As…

Machine Learning · Computer Science 2025-07-22 Dachuan Shi , Yonggan Fu , Xiangchi Yuan , Zhongzhi Yu , Haoran You , Sixu Li , Xin Dong , Jan Kautz , Pavlo Molchanov , Yingyan , Lin

Latent-Condensed Transformer for Efficient Long Context Modeling

Large language models (LLMs) face significant challenges in processing long contexts due to the linear growth of the key-value (KV) cache and quadratic complexity of self-attention. Existing approaches address these bottlenecks separately:…

Computation and Language · Computer Science 2026-04-17 Zeng You , Yaofo Chen , Qiuwu Chen , Ying Sun , Shuhai Zhang , Yingjian Li , Yaowei Wang , Mingkui Tan

KVCompose: Efficient Structured KV Cache Compression with Composite Tokens

Large language models (LLMs) rely on key-value (KV) caches for efficient autoregressive decoding; however, cache size grows linearly with context length and model depth, becoming a major bottleneck in long-context inference. Prior KV cache…

Machine Learning · Computer Science 2025-09-22 Dmitry Akulov , Mohamed Sana , Antonio De Domenico , Tareq Si Salem , Nicola Piovesan , Fadhel Ayed

NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time

Large Language Models (LLMs) have ignited an innovative surge of AI applications, marking a new era of exciting possibilities equipped with extended context windows. However, hosting these models is cost-prohibitive mainly due to the…

Computation and Language · Computer Science 2024-08-09 Yilong Chen , Guoxia Wang , Junyuan Shang , Shiyao Cui , Zhenyu Zhang , Tingwen Liu , Shuohuan Wang , Yu Sun , Dianhai Yu , Hua Wu

Stateful KV Cache Management for LLMs: Balancing Space, Time, Accuracy, and Positional Fidelity

The Key-Value (KV) cache is integral to efficient autoregressive inference in large language models (LLMs), yet its unbounded growth in stateful multi-turn scenarios presents major challenges. This paper examines the interplay between KV…

Machine Learning · Computer Science 2025-11-10 Pratik Poudel

MiniCache: KV Cache Compression in Depth Dimension for Large Language Models

A critical approach for efficiently deploying computationally demanding large language models (LLMs) is Key-Value (KV) caching. The KV cache stores key-value states of previously generated tokens, significantly reducing the need for…

Computation and Language · Computer Science 2024-09-10 Akide Liu , Jing Liu , Zizheng Pan , Yefei He , Gholamreza Haffari , Bohan Zhuang

Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention

Interacting with humans through multi-turn conversations is a fundamental feature of large language models (LLMs). However, existing LLM serving engines executing multi-turn conversations are inefficient due to the need to repeatedly…

Computation and Language · Computer Science 2024-07-02 Bin Gao , Zhuomin He , Puru Sharma , Qingxuan Kang , Djordje Jevdjic , Junbo Deng , Xingkun Yang , Zhou Yu , Pengfei Zuo

KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding

Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI. Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value (KV) cache during…

Computation and Language · Computer Science 2025-07-16 Luohe Shi , Zuchao Li , Lefei Zhang , Guoming Liu , Baoyuan Qi , Hai Zhao

WeightedKV: Attention Scores Weighted Key-Value Cache Merging for Large Language Models

Large Language Models (LLMs) use key-value (KV) cache to reduce redundant computation in autoregressive generation. However, the KV cache size increases linearly during generation, leading to excessive memory usage, especially for long…

Computation and Language · Computer Science 2025-03-04 Jian Yuan , Ziwei He , Haoli Bai , Jingwen Leng , Bo Jiang

Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving

Key-Value cache (\texttt{KV} \texttt{cache}) compression has emerged as a promising technique to optimize Large Language Model (LLM) serving. It primarily decreases the memory consumption of \texttt{KV} \texttt{cache} to reduce the…

Machine Learning · Computer Science 2025-04-01 Wei Gao , Xinyu Zhou , Peng Sun , Tianwei Zhang , Yonggang Wen

AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving

Large language model (LLM) applications often reuse previously processed context, such as chat history and documents, which introduces significant redundant computation. Existing LLM serving systems address such redundant computation by…

Operating Systems · Computer Science 2026-01-19 Shaoting Feng , Hanchen Li , Kuntai Du , Zhuohan Gu , Yuhan Liu , Jiayi Yao , Siddhant Ray , Samuel Shen , Yihua Cheng , Ganesh Ananthanarayanan , Junchen Jiang

Efficient Inference of Vision Instruction-Following Models with Elastic Cache

In the field of instruction-following large vision-language models (LVLMs), the efficient deployment of these models faces challenges, notably due to the high memory demands of their key-value (KV) caches. Conventional cache management…

Computer Vision and Pattern Recognition · Computer Science 2024-07-26 Zuyan Liu , Benlin Liu , Jiahui Wang , Yuhao Dong , Guangyi Chen , Yongming Rao , Ranjay Krishna , Jiwen Lu

TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection

Rapid advances in Large Language Models (LLMs) have spurred demand for processing extended context sequences in contemporary applications. However, this progress faces two challenges: performance degradation due to sequence lengths…

Computation and Language · Computer Science 2025-10-10 Wei Wu , Zhuoshi Pan , Chao Wang , Liyi Chen , Yunchu Bai , Tianfu Wang , Kun Fu , Zheng Wang , Hui Xiong