English
Related papers

Related papers: MLKV: Multi-Layer Key-Value Heads for Memory Effic…

200 papers

The key-value (KV) cache is a primary memory bottleneck in Transformers. We propose Low-Rank Key-Value (LRKV) attention, which reduces KV cache memory by exploiting redundancy across attention heads, while being compute efficient. Each…

Machine Learning · Computer Science 2026-04-09 James O'Neill , Robert Clancy , Mariia Matskevichus , Fergal Reid

Key-value (KV) caching plays an essential role in accelerating decoding for transformer-based autoregressive large language models (LLMs). However, the amount of memory required to store the KV cache can become prohibitive at long sequence…

Machine Learning · Computer Science 2024-05-22 William Brandon , Mayank Mishra , Aniruddha Nrusimha , Rameswar Panda , Jonathan Ragan Kelly

Huge memory consumption has been a major bottleneck for deploying high-throughput large language models in real-world applications. In addition to the large number of parameters, the key-value (KV) cache for the attention mechanism in the…

Computation and Language · Computer Science 2024-06-05 Haoyi Wu , Kewei Tu

Global KV-cache sharing is an effective optimization for accelerating large language model (LLM) inference, yet it introduces an API-visible timing side channel that lets adversaries infer sensitive user inputs from shared entries, leading…

Cryptography and Security · Computer Science 2026-02-11 Kexin Chu , Zecheng Lin , Dawei Xiang , Zixu Shen , Jianchang Su , Cheng Chu , Yiwei Yang , Wenhui Zhang , Wenfei Wu , Wei Zhang

Recently, sharing key-value (KV) cache across layers has been found effective in efficient inference of large language models (LLMs). To systematically investigate different techniques of cross-layer KV sharing, we propose a unified…

Computation and Language · Computer Science 2025-02-06 You Wu , Haoyi Wu , Kewei Tu

Key-value (KV) caching is critical for efficient inference in large language models (LLMs), yet its memory footprint scales linearly with context length, resulting in a severe scalability bottleneck. Existing approaches largely treat KV…

Computation and Language · Computer Science 2026-04-23 Gradwell Dzikanyanga , Weihao Yang , Hao Huang , Donglei Wu , Shihao Wang , Wen Xia , Sanjeeb K C

Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on the…

Computation and Language · Computer Science 2026-04-28 Zahra Dehghanighobadi , Asja Fischer

The Key-Value (KV) cache is a crucial component in serving transformer-based autoregressive large language models (LLMs), enabling faster inference by storing previously computed KV vectors. However, its memory consumption scales linearly…

Machine Learning · Computer Science 2024-10-07 Rongzhi Zhang , Kuang Wang , Liyuan Liu , Shuohang Wang , Hao Cheng , Chao Zhang , Yelong Shen

Recently the generative Large Language Model (LLM) has achieved remarkable success in numerous applications. Notably its inference generates output tokens one-by-one, leading to many redundant computations. The widely-used KV-Cache…

Machine Learning · Computer Science 2024-12-10 Weizhuo Li , Zhigang Wang , Yu Gu , Ge Yu

Key-Value (KV) cache has become a de facto component of modern Large Vision-Language Models (LVLMs) for inference. While it enhances decoding efficiency in Large Language Models (LLMs), its direct adoption in LVLMs introduces substantial…

Computer Vision and Pattern Recognition · Computer Science 2026-05-04 Xihao Chen , Yangyang Guo , Roger Zimmermann

Transformer-based large language models (LLMs) have demonstrated remarkable potential across a wide range of practical applications. However, long-context inference remains a significant challenge due to the substantial memory requirements…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-09 Bo Jiang , Taolue Yang , Youyuan Liu , Xubin He , Sheng Di , Sian Jin

While Transformer self-attention offers strong parallelism, the Key-Value (KV) cache grows linearly with sequence length and becomes a bottleneck for inference efficiency. Multi-head latent attention was recently developed to compress the…

Machine Learning · Computer Science 2025-11-04 Keqi Deng , Philip C. Woodland

Large language models (LLMs) can now handle longer sequences of tokens, enabling complex tasks like book understanding and generating lengthy novels. However, the key-value (KV) cache required for LLMs consumes substantial memory as context…

Machine Learning · Computer Science 2024-11-13 Haojie Duanmu , Zhihang Yuan , Xiuhong Li , Jiangfei Duan , Xingcheng Zhang , Dahua Lin

As long-context language modeling becomes increasingly important, the cost of maintaining and attending to large Key/Value (KV) caches grows rapidly, becoming a major bottleneck in both training and inference. While prior works such as…

Machine Learning · Computer Science 2026-03-25 Dong Liu , Yanxuan Yu , Ben Lengerich , Ying Nian Wu

With the widespread deployment of long-context large language models (LLMs), there has been a growing demand for efficient support of high-throughput inference. However, as the key-value (KV) cache expands with the sequence length, the…

Machine Learning · Computer Science 2025-04-29 Hanshi Sun , Li-Wen Chang , Wenlei Bao , Size Zheng , Ningxin Zheng , Xin Liu , Harry Dong , Yuejie Chi , Beidi Chen

Efficient inference of large language models (LLMs) is hindered by an ever-growing key-value (KV) cache, making KV cache compression a critical research direction. Traditional methods selectively evict less important KV cache entries, which…

Machine Learning · Computer Science 2025-12-01 Yuxuan Tian , Zihan Wang , Yebo Peng , Aomufei Yuan , Zhiming Wang , Bairen Yi , Xin Liu , Yong Cui , Tong Yang

How to efficiently serve LLMs in practice has become exceptionally challenging due to their prohibitive memory and computation requirements. In this study, we investigate optimizing the KV cache, whose memory footprint poses a critical…

Computation and Language · Computer Science 2025-06-10 Akshat Sharma , Hangliang Ding , Jianping Li , Neel Dani , Minjia Zhang

Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI. Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value (KV) cache during…

Computation and Language · Computer Science 2025-07-16 Luohe Shi , Zuchao Li , Lefei Zhang , Guoming Liu , Baoyuan Qi , Hai Zhao

Large Language Models (LLMs) suffer inference-time memory bottlenecks dominated by the attention Key-Value (KV) cache, which scales with model size and context length. While KV-cache quantization alleviates this cost, bit allocation between…

Machine Learning · Computer Science 2026-05-12 Mohsen Hariri , Alan Luo , Weicong Chen , Shaochen Zhong , Tianyi Zhang , Qifan Wang , Xia Hu , Xiaotian Han , Vipin Chaudhary

We observe two major trends in LLM-based generative AI: (1) inference is becoming the dominant factor in terms of cost and power consumption, surpassing training, and (2) retrieval augmented generation (RAG) is becoming prevalent. When…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-30 Kun-Woo Shin , Jay H. Park , Moonwook Oh , Yohan Jo , Jaeyoung Do , Sang-Won Lee
‹ Prev 1 2 3 10 Next ›