Related papers: Efficient LLM Inference with Kcache

KV Cache Compression for Inference Efficiency in LLMs: A Review

Withtherapid advancement of large language models (LLMs), the context length for inference has been continuously increasing, leading to an exponential growth in the demand for Key-Value (KV) caching. This has resulted in a significant…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-11 Yanyu Liu , Jingying Fu , Sixiang Liu , Yitian Zou , You Fu , Jiehan Zhou , Shouhua Zhang

A Survey on Large Language Model Acceleration based on KV Cache Management

Large Language Models (LLMs) have revolutionized a wide range of domains such as natural language processing, computer vision, and multi-modal tasks due to their ability to comprehend context and perform logical reasoning. However, the…

Artificial Intelligence · Computer Science 2025-07-31 Haoyang Li , Yiming Li , Anxin Tian , Tianhao Tang , Zhanchao Xu , Xuejia Chen , Nicole Hu , Wei Dong , Qing Li , Lei Chen

XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference

Recently the generative Large Language Model (LLM) has achieved remarkable success in numerous applications. Notably its inference generates output tokens one-by-one, leading to many redundant computations. The widely-used KV-Cache…

Machine Learning · Computer Science 2024-12-10 Weizhuo Li , Zhigang Wang , Yu Gu , Ge Yu

Layer-Condensed KV Cache for Efficient Inference of Large Language Models

Huge memory consumption has been a major bottleneck for deploying high-throughput large language models in real-world applications. In addition to the large number of parameters, the key-value (KV) cache for the attention mechanism in the…

Computation and Language · Computer Science 2024-06-05 Haoyi Wu , Kewei Tu

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

KV cache has traditionally been stored in GPU memory to accelerate the decoding phase of large language model (LLM) inference. However, it is increasingly necessary to move KV caches outside GPU devices, to enable cache reuse across…

Machine Learning · Computer Science 2025-12-08 Yuhan Liu , Yihua Cheng , Jiayi Yao , Yuwei An , Xiaokun Chen , Shaoting Feng , Yuyang Huang , Samuel Shen , Rui Zhang , Kuntai Du , Junchen Jiang

A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference

Recently, sharing key-value (KV) cache across layers has been found effective in efficient inference of large language models (LLMs). To systematically investigate different techniques of cross-layer KV sharing, we propose a unified…

Computation and Language · Computer Science 2025-02-06 You Wu , Haoyi Wu , Kewei Tu

PQCache: Product Quantization-based KVCache for Long Context LLM Inference

As the field of Large Language Models (LLMs) continues to evolve, the context length in inference is steadily growing. Key-Value Cache (KVCache), the intermediate representations of tokens within LLM inference, has now become the primary…

Computation and Language · Computer Science 2025-04-01 Hailin Zhang , Xiaodong Ji , Yilin Chen , Fangcheng Fu , Xupeng Miao , Xiaonan Nie , Weipeng Chen , Bin Cui

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management

Transformer-based large language models (LLMs) demonstrate impressive performance across various natural language processing tasks. Serving LLM inference for generating long contents, however, poses a challenge due to the enormous memory…

Machine Learning · Computer Science 2024-07-01 Wonbeom Lee , Jungi Lee , Junghwan Seo , Jaewoong Sim

Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching

Large Language Models (LLMs) exhibit pronounced memory-bound characteristics during inference due to High Bandwidth Memory (HBM) bandwidth constraints. In this paper, we propose an L2 Cache-oriented asynchronous KV Cache prefetching method…

Machine Learning · Computer Science 2025-11-11 Yanhao Dong , Yubo Miao , Weinan Li , Xiao Zheng , Chao Wang , Jiesheng Wu , Feng Lyu

Online Scheduling for LLM Inference with KV Cache Constraints

Large Language Model (LLM) inference, where a trained model generates text one word at a time in response to user prompts, is a computationally intensive process requiring efficient scheduling to optimize latency and resource utilization. A…

Machine Learning · Computer Science 2026-01-16 Patrick Jaillet , Jiashuo Jiang , Konstantina Mellou , Marco Molinaro , Chara Podimata , Zijie Zhou

KV Cache Optimization Strategies for Scalable and Efficient LLM Inference

The key-value (KV) cache is a foundational optimization in Transformer-based large language models (LLMs), eliminating redundant recomputation of past token representations during autoregressive generation. However, its memory footprint…

Machine Learning · Computer Science 2026-03-24 Yichun Xu , Navjot K. Khaira , Tejinder Singh

Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System

Large Language Model (LLM) inference is increasingly constrained by memory bandwidth, with frequent access to the key-value (KV) cache dominating data movement. While attention sparsity reduces some memory traffic, the relevance of past…

Hardware Architecture · Computer Science 2025-09-16 Yunhua Fang , Rui Xie , Asad Ul Haq , Linsen Ma , Kaoutar El Maghraoui , Naigang Wang , Meng Wang , Liu Liu , Tong Zhang

Comparative Characterization of KV Cache Management Strategies for LLM Inference

Efficient inference with Large Language Models (LLMs) increasingly relies on Key-Value (KV) caches to store previously computed key and value vectors at each layer. These caches are essential to minimize redundant computation during…

Hardware Architecture · Computer Science 2026-04-08 Oteo Mamo , Olga Kogiou , Hyunjin Yi , Weikuan Yu

ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty

Large Language models (LLMs) have become a research hotspot. To accelerate the inference of LLMs, storing computed caches in memory has become the standard technique. However, as the inference length increases, growing KV caches might lead…

Computation and Language · Computer Science 2024-12-13 Meizhi Zhong , Xikai Liu , Chen Zhang , Yikun Lei , Yan Gao , Yao Hu , Kehai Chen , Min Zhang

Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM Inference

The increasing adoption of large language models (LLMs) with extended context windows necessitates efficient Key-Value Cache (KVC) management to optimize inference performance. Inference workloads like Retrieval-Augmented Generation (RAG)…

Emerging Technologies · Computer Science 2025-05-29 Yue Zhu , Hao Yu , Chen Wang , Zhuoran Liu , Eun Kyung Lee

KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache

Transformer-based large language models (LLMs) demonstrate impressive potential in various practical applications. However, long context inference poses a significant challenge due to the enormous memory requirements of the key-value (KV)…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-03 Bo Jiang , Taolue Yang , Youyuan Liu , Chengming Zhang , Xubin He , Sian Jin

LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models

Recent advancements in Large Language Models (LLMs) have spurred interest in numerous applications requiring robust long-range capabilities, essential for processing extensive input contexts and continuously generating extended outputs. As…

Machine Learning · Computer Science 2025-07-22 Dachuan Shi , Yonggan Fu , Xiangchi Yuan , Zhongzhi Yu , Haoran You , Sixu Li , Xin Dong , Jan Kautz , Pavlo Molchanov , Yingyan , Lin

KVCrush: Key value cache size-reduction using similarity in head-behaviour

Key-value (KV) caching has emerged as a crucial optimization technique for accelerating inference in large language models (LLMs). By allowing the attention operation to scale linearly rather than quadratically with the total sequence…

Computation and Language · Computer Science 2026-01-06 Gopi Krishna Jha , Sameh Gobriel , Liubov Talamanova , Nilesh Jain

CORM: Cache Optimization with Recent Message for Large Language Model Inference

Large Language Models (LLMs), despite their remarkable performance across a wide range of tasks, necessitate substantial GPU memory and consume significant computational resources. Beyond the memory taken up by model weights, the memory…

Computation and Language · Computer Science 2024-06-24 Jincheng Dai , Zhuowei Huang , Haiyun Jiang , Chen Chen , Deng Cai , Wei Bi , Shuming Shi

BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference

In Large Language Model (LLM) inference, Key-Value (KV) caches (KV-caches) are essential for reducing time complexity. However, they result in a linear increase in GPU memory as the context length grows. While recent work explores KV-cache…

Machine Learning · Computer Science 2025-02-25 Ahmed Burak Gulhan , Krishna Teja Chitty-Venkata , Murali Emani , Mahmut Kandemir , Venkatram Vishwanath