English
Related papers

Related papers: CacheFocus: Dynamic Cache Re-Positioning for Effic…

200 papers

Large Language Models (LLMs) have revolutionized a wide range of domains such as natural language processing, computer vision, and multi-modal tasks due to their ability to comprehend context and perform logical reasoning. However, the…

Artificial Intelligence · Computer Science 2025-07-31 Haoyang Li , Yiming Li , Anxin Tian , Tianhao Tang , Zhanchao Xu , Xuejia Chen , Nicole Hu , Wei Dong , Qing Li , Lei Chen

Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory…

Computation and Language · Computer Science 2026-05-21 Seonghwan Choi , Beomseok Kang , Dongwon Jo , Jae-Joon Kim

The linear memory growth of the KV cache poses a significant bottleneck for LLM inference in long-context tasks. Existing static compression methods often fail to preserve globally important information. Although recent dynamic retrieval…

Computation and Language · Computer Science 2026-04-21 Zhiyuan Shi , Qibo Qiu , Feng Xue , Zhonglin Jiang , Li Yu , Jian Jiang , Xiaofei He , Wenxiao Wang

With context windows of millions of tokens, Long-Context Language Models (LCLMs) can encode entire document collections, offering a strong alternative to conventional retrieval-augmented generation (RAG). However, it remains unclear whether…

Computation and Language · Computer Science 2026-01-27 Francesco Maria Molfese , Momchil Hardalov , Rexhina Blloshmi , Bill Byrne , Adrià de Gispert

The transformer's context window is vital for tasks such as few-shot learning and conditional generation as it preserves previous tokens for active memory. However, as the context lengths increase, the computational costs grow…

Computation and Language · Computer Science 2025-04-01 Jeffrey Willette , Heejun Lee , Youngwan Lee , Myeongjae Jeon , Sung Ju Hwang

Large Language Models (LLMs) face limitations due to the high demand on GPU memory and computational resources when handling long contexts. While sparsify the Key-Value (KV) cache of transformer model is a typical strategy to alleviate…

Machine Learning · Computer Science 2024-02-06 Yumeng Wang , Zhenyang Xiao

Long-context Multimodal Large Language Models (MLLMs) demand substantial computational resources for inference as the growth of their multimodal Key-Value (KV) cache, in response to increasing input lengths, challenges memory and time…

Computation and Language · Computer Science 2024-06-27 Zhongwei Wan , Ziang Wu , Che Liu , Jinfa Huang , Zhihong Zhu , Peng Jin , Longyue Wang , Li Yuan

Recent advancements in Large Language Models (LLMs) have spurred interest in numerous applications requiring robust long-range capabilities, essential for processing extensive input contexts and continuously generating extended outputs. As…

Machine Learning · Computer Science 2025-07-22 Dachuan Shi , Yonggan Fu , Xiangchi Yuan , Zhongzhi Yu , Haoran You , Sixu Li , Xin Dong , Jan Kautz , Pavlo Molchanov , Yingyan , Lin

Large language models (LLMs) face significant challenges in processing long contexts due to the linear growth of the key-value (KV) cache and quadratic complexity of self-attention. Existing approaches address these bottlenecks separately:…

Computation and Language · Computer Science 2026-04-17 Zeng You , Yaofo Chen , Qiuwu Chen , Ying Sun , Shuhai Zhang , Yingjian Li , Yaowei Wang , Mingkui Tan

Large language models (LLMs) rely on key-value (KV) caches for efficient autoregressive decoding; however, cache size grows linearly with context length and model depth, becoming a major bottleneck in long-context inference. Prior KV cache…

Machine Learning · Computer Science 2025-09-22 Dmitry Akulov , Mohamed Sana , Antonio De Domenico , Tareq Si Salem , Nicola Piovesan , Fadhel Ayed

Large Language Models (LLMs) have ignited an innovative surge of AI applications, marking a new era of exciting possibilities equipped with extended context windows. However, hosting these models is cost-prohibitive mainly due to the…

Computation and Language · Computer Science 2024-08-09 Yilong Chen , Guoxia Wang , Junyuan Shang , Shiyao Cui , Zhenyu Zhang , Tingwen Liu , Shuohuan Wang , Yu Sun , Dianhai Yu , Hua Wu

The Key-Value (KV) cache is integral to efficient autoregressive inference in large language models (LLMs), yet its unbounded growth in stateful multi-turn scenarios presents major challenges. This paper examines the interplay between KV…

Machine Learning · Computer Science 2025-11-10 Pratik Poudel

A critical approach for efficiently deploying computationally demanding large language models (LLMs) is Key-Value (KV) caching. The KV cache stores key-value states of previously generated tokens, significantly reducing the need for…

Computation and Language · Computer Science 2024-09-10 Akide Liu , Jing Liu , Zizheng Pan , Yefei He , Gholamreza Haffari , Bohan Zhuang

Interacting with humans through multi-turn conversations is a fundamental feature of large language models (LLMs). However, existing LLM serving engines executing multi-turn conversations are inefficient due to the need to repeatedly…

Computation and Language · Computer Science 2024-07-02 Bin Gao , Zhuomin He , Puru Sharma , Qingxuan Kang , Djordje Jevdjic , Junbo Deng , Xingkun Yang , Zhou Yu , Pengfei Zuo

Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI. Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value (KV) cache during…

Computation and Language · Computer Science 2025-07-16 Luohe Shi , Zuchao Li , Lefei Zhang , Guoming Liu , Baoyuan Qi , Hai Zhao

Large Language Models (LLMs) use key-value (KV) cache to reduce redundant computation in autoregressive generation. However, the KV cache size increases linearly during generation, leading to excessive memory usage, especially for long…

Computation and Language · Computer Science 2025-03-04 Jian Yuan , Ziwei He , Haoli Bai , Jingwen Leng , Bo Jiang

Key-Value cache (\texttt{KV} \texttt{cache}) compression has emerged as a promising technique to optimize Large Language Model (LLM) serving. It primarily decreases the memory consumption of \texttt{KV} \texttt{cache} to reduce the…

Machine Learning · Computer Science 2025-04-01 Wei Gao , Xinyu Zhou , Peng Sun , Tianwei Zhang , Yonggang Wen

Large language model (LLM) applications often reuse previously processed context, such as chat history and documents, which introduces significant redundant computation. Existing LLM serving systems address such redundant computation by…

In the field of instruction-following large vision-language models (LVLMs), the efficient deployment of these models faces challenges, notably due to the high memory demands of their key-value (KV) caches. Conventional cache management…

Computer Vision and Pattern Recognition · Computer Science 2024-07-26 Zuyan Liu , Benlin Liu , Jiahui Wang , Yuhao Dong , Guangyi Chen , Yongming Rao , Ranjay Krishna , Jiwen Lu

Rapid advances in Large Language Models (LLMs) have spurred demand for processing extended context sequences in contemporary applications. However, this progress faces two challenges: performance degradation due to sequence lengths…

Computation and Language · Computer Science 2025-10-10 Wei Wu , Zhuoshi Pan , Chao Wang , Liyi Chen , Yunchu Bai , Tianfu Wang , Kun Fu , Zheng Wang , Hui Xiong
‹ Prev 1 2 3 10 Next ›