English
Related papers

Related papers: SCOPE: Optimizing Key-Value Cache Compression in L…

200 papers

Key-Value (KV) cache remains a major bottleneck for deploying Large Language Models (LLMs) in long-generation tasks. Prior work often applies uniform compression across both prefill and decoding caches, but compressing the prefill cache…

Artificial Intelligence · Computer Science 2026-05-29 Soumyadeep Jana , Sagar Nishad , Sanasam Ranbir Singh

Withtherapid advancement of large language models (LLMs), the context length for inference has been continuously increasing, leading to an exponential growth in the demand for Key-Value (KV) caching. This has resulted in a significant…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-11 Yanyu Liu , Jingying Fu , Sixiang Liu , Yitian Zou , You Fu , Jiehan Zhou , Shouhua Zhang

Transformer-based large language models (LLMs) demonstrate impressive potential in various practical applications. However, long context inference poses a significant challenge due to the enormous memory requirements of the key-value (KV)…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-03 Bo Jiang , Taolue Yang , Youyuan Liu , Chengming Zhang , Xubin He , Sian Jin

Key-Value cache (\texttt{KV} \texttt{cache}) compression has emerged as a promising technique to optimize Large Language Model (LLM) serving. It primarily decreases the memory consumption of \texttt{KV} \texttt{cache} to reduce the…

Machine Learning · Computer Science 2025-04-01 Wei Gao , Xinyu Zhou , Peng Sun , Tianwei Zhang , Yonggang Wen

Transformer-based large language models (LLMs) have already achieved remarkable results on long-text tasks, but the limited GPU memory (VRAM) resources struggle to accommodate the linearly growing demand for key-value (KV) cache as the…

Computation and Language · Computer Science 2025-03-21 Shibo Jie , Yehui Tang , Kai Han , Zhi-Hong Deng , Jing Han

The Key-Value (KV) cache is a crucial component in serving transformer-based autoregressive large language models (LLMs), enabling faster inference by storing previously computed KV vectors. However, its memory consumption scales linearly…

Machine Learning · Computer Science 2024-10-07 Rongzhi Zhang , Kuang Wang , Liyuan Liu , Shuohang Wang , Hao Cheng , Chao Zhang , Yelong Shen

Large language models (LLMs) have demonstrated exceptional capabilities in generating text, images, and video content. However, as context length grows, the computational cost of attention increases quadratically with the number of tokens,…

Computation and Language · Computer Science 2025-04-23 Neusha Javidnia , Bita Darvish Rouhani , Farinaz Koushanfar

Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory…

Computation and Language · Computer Science 2026-05-21 Seonghwan Choi , Beomseok Kang , Dongwon Jo , Jae-Joon Kim

Large language models (LLMs) have demonstrated remarkable performance on long-context tasks, but are often bottlenecked by memory constraints. Namely, the KV cache, which is used to significantly speed up attention computations, grows…

The key-value (KV) cache is a foundational optimization in Transformer-based large language models (LLMs), eliminating redundant recomputation of past token representations during autoregressive generation. However, its memory footprint…

Machine Learning · Computer Science 2026-03-24 Yichun Xu , Navjot K. Khaira , Tejinder Singh

The Key-Value (KV) cache is central to the efficiency of transformer-based large language models (LLMs), storing previously computed vectors to accelerate inference. Yet, as sequence length and batch size grow, the cache becomes a major…

Machine Learning · Computer Science 2025-12-08 Damien Lesens , Beheshteh T. Rakhshan , Guillaume Rabusseau

Large language models (LLMs) rely on key-value (KV) caches for efficient autoregressive decoding; however, cache size grows linearly with context length and model depth, becoming a major bottleneck in long-context inference. Prior KV cache…

Machine Learning · Computer Science 2025-09-22 Dmitry Akulov , Mohamed Sana , Antonio De Domenico , Tareq Si Salem , Nicola Piovesan , Fadhel Ayed

With the growing demand for long-context LLMs across a wide range of applications, the key-value (KV) cache has become a critical bottleneck for both latency and memory usage. Recently, KV-cache offloading has emerged as a promising…

Machine Learning · Computer Science 2026-05-18 Andrey Bocharnikov , Ivan Ermakov , Denis Kuznedelev , Vyacheslav Zhdanovskiy , Yegor Yershov

The Key-Value (KV) cache is crucial for efficient Large Language Models (LLMs) inference, but excessively long contexts drastically increase KV cache memory footprint. Existing KV cache compression methods typically rely on input-side…

Computation and Language · Computer Science 2026-03-13 Zhenxu Tian , Yi Su , Juntao Li , Min Zhang

Recent reasoning models such as OpenAI-o1 and DeepSeek-R1 have shown strong performance on complex tasks including mathematical reasoning and code generation. However, this performance gain comes with substantially longer output sequences,…

Machine Learning · Computer Science 2026-04-28 Yi Su , Zhenxu Tian , Dan Qiao , Yuechi Zhou , Juntao Li , Min Zhang

Large language models (LLMs) have shown great performance on complex reasoning tasks but often require generating long intermediate thoughts before reaching a final answer. During generation, LLMs rely on a key-value (KV) cache for…

Context lengths of Large Language Models (LLMs) have exploded in recent years, with 128k-token context becoming a standard and million-token context becoming a reality. Efficiently supporting long-context inference remains challenging as…

Computation and Language · Computer Science 2024-10-08 Isaac Rehg

Large language models face significant computational and memory challenges when processing long contexts. During inference, efficient management of the key-value (KV) cache, which stores intermediate activations for autoregressive…

Computation and Language · Computer Science 2025-09-30 Yuxuan Zhu , Ali Falahati , David H. Yang , Mohammad Mohammadi Amiri

As Large Language Models (LLMs) scale to support context windows exceeding one million tokens, the linear growth of Key-Value (KV) cache imposes severe memory capacity and bandwidth bottlenecks, constraining the efficiency of long-context…

Computation and Language · Computer Science 2026-04-09 Zhirui Chen , Peiyang Liu , Ling Shao

Several works have developed eviction policies to remove key-value (KV) pairs from the KV cache for more efficient inference. The focus has been on compressing the KV cache after the input prompt has been processed for faster token…

Computation and Language · Computer Science 2025-07-04 Michael R. Metel , Boxing Chen , Mehdi Rezagholizadeh
‹ Prev 1 2 3 10 Next ›