English
Related papers

Related papers: LongFlow: Efficient KV Cache Compression for Reaso…

200 papers

Large language models (LLMs) have demonstrated remarkable performance on long-context tasks, but are often bottlenecked by memory constraints. Namely, the KV cache, which is used to significantly speed up attention computations, grows…

Long-form video question answering (VQA) overwhelms current vision-language models (VLMs) because attention and key-value (KV) caches grow with runtime, forcing either expensive inference or near-sighted sliding windows. We introduce…

Computer Vision and Pattern Recognition · Computer Science 2025-11-18 Shrenik Patel , Daivik Patel

Long Chain-of-Thought (CoT) reasoning has significantly advanced the capabilities of Large Language Models (LLMs), but this progress is accompanied by substantial memory and latency overhead from the extensive Key-Value (KV) cache. Although…

Machine Learning · Computer Science 2025-12-23 Tao Zhang , Ziqian Zeng , Hao Peng , Huiping Zhuang , Cen Chen

Reasoning models have demonstrated impressive performance in self-reflection and chain-of-thought reasoning. However, they often produce excessively long outputs, leading to prohibitively large key-value (KV) caches during inference. While…

Transformer-based large language models (LLMs) demonstrate impressive potential in various practical applications. However, long context inference poses a significant challenge due to the enormous memory requirements of the key-value (KV)…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-03 Bo Jiang , Taolue Yang , Youyuan Liu , Chengming Zhang , Xubin He , Sian Jin

Withtherapid advancement of large language models (LLMs), the context length for inference has been continuously increasing, leading to an exponential growth in the demand for Key-Value (KV) caching. This has resulted in a significant…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-11 Yanyu Liu , Jingying Fu , Sixiang Liu , Yitian Zou , You Fu , Jiehan Zhou , Shouhua Zhang

Key-Value (KV) caching is a common technique to enhance the computational efficiency of Large Language Models (LLMs), but its memory overhead grows rapidly with input length. Prior work has shown that not all tokens are equally important…

Computation and Language · Computer Science 2025-10-24 Yu Fu , Zefan Cai , Abedelkadir Asi , Wayne Xiong , Yue Dong , Wen Xiao

Large Reasoning Models (LRMs) are becoming integral to many AI inference systems, enhancing their capabilities with advanced reasoning. However, deploying these models in production environments presents a significant QoS challenge: the…

Machine Learning · Computer Science 2026-05-15 Kaiwen Chen , Xin Tan , Minchen Yu , Jingzong Li , Hong Xu

While Key-Value (KV) cache compression is essential for efficient LLM inference, current evaluations disproportionately focus on sparse retrieval tasks, potentially masking the degradation of High-Density Reasoning where Chain-of-Thought…

Computation and Language · Computer Science 2026-05-13 Xiang Liu , Zhenheng Tang , Hong Chen , Peijie Dong , Zeyu Li , Xiuze Zhou , Bo Li , Xuming Hu , Xiaowen Chu

Context lengths of Large Language Models (LLMs) have exploded in recent years, with 128k-token context becoming a standard and million-token context becoming a reality. Efficiently supporting long-context inference remains challenging as…

Computation and Language · Computer Science 2024-10-08 Isaac Rehg

Reasoning large language models exhibit complex reasoning behaviors via extended chain-of-thought generation that are highly fragile to information loss during decoding, creating critical challenges for KV cache compression. Existing…

Computation and Language · Computer Science 2026-05-28 Wenjie Du , Li Jiang , Keda Tao , Xue Liu , Huan Wang

Serving long-context LLMs is challenging because request lengths and batch composition vary during token generation, causing the memory footprint to fluctuate significantly at runtime. Offloading KV caches to host memory limits effective…

Artificial Intelligence · Computer Science 2026-03-03 Xinyue Ma , Heelim Hong , Taegeon Um , Jongseop Lee , Seoyeong Choy , Woo-Yeon Lee , Myeongjae Jeon

Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory…

Computation and Language · Computer Science 2026-05-21 Seonghwan Choi , Beomseok Kang , Dongwon Jo , Jae-Joon Kim

Large language models (LLMs) rely on key-value (KV) caches for efficient autoregressive decoding; however, cache size grows linearly with context length and model depth, becoming a major bottleneck in long-context inference. Prior KV cache…

Machine Learning · Computer Science 2025-09-22 Dmitry Akulov , Mohamed Sana , Antonio De Domenico , Tareq Si Salem , Nicola Piovesan , Fadhel Ayed

Key-value (KV) caching has emerged as a crucial optimization technique for accelerating inference in large language models (LLMs). By allowing the attention operation to scale linearly rather than quadratically with the total sequence…

Computation and Language · Computer Science 2026-01-06 Gopi Krishna Jha , Sameh Gobriel , Liubov Talamanova , Nilesh Jain

Processing long-context inputs with large language models presents a significant challenge due to the enormous memory requirements of the Key-Value (KV) cache during inference. Existing KV cache compression methods exhibit noticeable…

Computation and Language · Computer Science 2025-07-29 Dongquan Yang , Yifan Yang , Xiaotian Yu , Xianbiao Qi , Rong Xiao

Large language model (LLM) based agentic workflows have become a popular paradigm for coordinating multiple specialized agents to solve complex tasks. To improve serving efficiency, existing LLM systems employ prefix caching to reuse…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-11 Zaifeng Pan , Ajjkumar Patel , Zhengding Hu , Yipeng Shen , Yue Guan , Wan-Lu Li , Lianhui Qin , Yida Wang , Yufei Ding

Key-Value cache (\texttt{KV} \texttt{cache}) compression has emerged as a promising technique to optimize Large Language Model (LLM) serving. It primarily decreases the memory consumption of \texttt{KV} \texttt{cache} to reduce the…

Machine Learning · Computer Science 2025-04-01 Wei Gao , Xinyu Zhou , Peng Sun , Tianwei Zhang , Yonggang Wen

Large language models (LLMs) have demonstrated remarkable performance, but their long-context reasoning remains constrained by the excessive memory required for the Key-Value (KV) cache. This makes KV cache compression a critical step…

Machine Learning · Computer Science 2025-09-30 Xianglong Yan , Zhiteng Li , Tianao Zhang , Haotong Qin , Linghe Kong , Yulun Zhang , Xiaokang Yang

Supporting long-context LLMs is challenging due to the substantial memory demands of the key-value (KV) cache. Existing offloading systems store the full cache in host memory and selectively fetch critical entries during decoding, but this…

Computation and Language · Computer Science 2026-05-19 Jian Lin , Jiazhi Mi , Zicong Hong , Haodong Wang , Qianli Liu , Haodyue Zhang , Peng Li , Song Guo
‹ Prev 1 2 3 10 Next ›