Distributed, Parallel, and Cluster Computing · Computer Science
KV Cache Compression for Inference Efficiency in LLMs: A Review
Yanyu Liu, Jingying Fu, Sixiang Liu, Yitian Zou +3
2025-08-11
Artificial Intelligence · Computer Science
A Survey on Large Language Model Acceleration based on KV Cache Management
Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang +6
2025-07-31
Machine Learning · Computer Science
XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference
Weizhuo Li, Zhigang Wang, Yu Gu, Ge Yu
2024-12-10
Machine Learning · Computer Science
LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference
Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An +7
2025-12-08
Computation and Language · Computer Science
PQCache: Product Quantization-based KVCache for Long Context LLM Inference
Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu +4
2025-04-01
Machine Learning · Computer Science
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim
2024-07-01
Machine Learning · Computer Science
Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching
Yanhao Dong, Yubo Miao, Weinan Li, Xiao Zheng +3
2025-11-11
Machine Learning · Computer Science
Online Scheduling for LLM Inference with KV Cache Constraints
Patrick Jaillet, Jiashuo Jiang, Konstantina Mellou, Marco Molinaro +2
2026-01-16
Hardware Architecture · Computer Science
Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System
Yunhua Fang, Rui Xie, Asad Ul Haq, Linsen Ma +5
2025-09-16
Hardware Architecture · Computer Science
Comparative Characterization of KV Cache Management Strategies for LLM Inference
Oteo Mamo, Olga Kogiou, Hyunjin Yi, Weikuan Yu
2026-04-08
Computation and Language · Computer Science
ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty
Meizhi Zhong, Xikai Liu, Chen Zhang, Yikun Lei +4
2024-12-13
Emerging Technologies · Computer Science
Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM Inference
Yue Zhu, Hao Yu, Chen Wang, Zhuoran Liu +1
2025-05-29
Distributed, Parallel, and Cluster Computing · Computer Science
KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache
Bo Jiang, Taolue Yang, Youyuan Liu, Chengming Zhang +2
2025-09-03
Machine Learning · Computer Science
LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models
Dachuan Shi, Yonggan Fu, Xiangchi Yuan, Zhongzhi Yu +7
2025-07-22
Computation and Language · Computer Science
KVCrush: Key value cache size-reduction using similarity in head-behaviour
Gopi Krishna Jha, Sameh Gobriel, Liubov Talamanova, Nilesh Jain
2026-01-06
Computation and Language · Computer Science
CORM: Cache Optimization with Recent Message for Large Language Model Inference
Jincheng Dai, Zhuowei Huang, Haiyun Jiang, Chen Chen +3
2024-06-24
Machine Learning · Computer Science
BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference
Ahmed Burak Gulhan, Krishna Teja Chitty-Venkata, Murali Emani, Mahmut Kandemir +1
2025-02-25
Machine Learning · Computer Science
ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing
Kaiwen Chen, Xin Tan, Minchen Yu, Jingzong Li +1
2026-05-15
Computation and Language · Computer Science
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models
Akide Liu, Jing Liu, Zizheng Pan, Yefei He +2
2024-09-10
Computation and Language · Computer Science
MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache
Akshat Sharma, Hangliang Ding, Jianping Li, Neel Dani +1
2025-06-10
Machine Learning · Computer Science
Compute Or Load KV Cache? Why Not Both?
Shuowei Jin, Xueshen Liu, Qingzhao Zhang, Z. Morley Mao
2025-02-24
Computation and Language · Computer Science
Efficient Long-Context LLM Inference via KV Cache Clustering
Jie Hu, Shengnan Wang, Yutong He, Ping Gong +7
2025-06-16