English
Related papers

Related papers: VQ-LLM: High-performance Code Generation for Vecto…

200 papers

Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the key-value (KV) cache often becomes a memory bottleneck on GPUs as context grows. To address this, we propose Commutative Vector…

The growing context length of Large Language Models (LLMs) enlarges the Key-Value (KV) cache, limiting deployment in resource-limited environments. Prior training-free approaches for KV cache compression typically rely on low-rank…

Computation and Language · Computer Science 2026-03-18 Yixuan Wang , Qingyu Shi , Jiayu Zhou , Dianbo Liu , Ziwei He , Zhouhan Lin

The Key-Value (KV) cache introduces substantial memory overhead during large language model (LLM) inference. Although existing vector quantization (VQ) methods reduce KV cache usage and provide flexible representational capacity across…

Computation and Language · Computer Science 2025-10-08 Dingyu Yao , Chenxu Yang , Zhengyang Tong , Zheng Lin , Wei Liu , Jian Luan , Weiping Wang

Recent breakthroughs in Large-scale language models (LLMs) have demonstrated impressive performance on various tasks. The immense sizes of LLMs have led to very high resource demand and cost for running the models. Though the models are…

Machine Learning · Computer Science 2024-03-05 Juntao Zhao , Borui Wan , Yanghua Peng , Haibin Lin , Chuan Wu

Modern large language models (LLMs) drive interactive AI systems but are bottlenecked by the memory-heavy growth of key-value (KV) caches, which limits real-time throughput under concurrent loads. Existing KV-cache compression methods rely…

Machine Learning · Computer Science 2026-01-07 Joseph Kampeas , Emir Haleva

Recently, video language models (VLMs) have been applied in various fields. However, the visual token sequence of the VLM is too long, which may cause intolerant inference latency and GPU memory usage. Existing methods propose…

Computer Vision and Pattern Recognition · Computer Science 2026-05-05 Wei Tao , Xiaoyang Qu , Peiqiang Wang , Guokuan Li , Jiguang Wan , Kai Lu , Jianzong Wang

KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batch-size scenarios while preserving LLMs effectiveness. However, current methods have three unsolved issues:…

Machine Learning · Computer Science 2025-11-21 Xing Li , Zeyu Xing , Yiming Li , Linping Qu , Hui-Ling Zhen , Wulong Liu , Yiwu Yao , Sinno Jialin Pan , Mingxuan Yuan

Vision-Language Models (VLMs) achieve outstanding performance, yet their huge model size severely hinders deployment on edge devices with limited resources. As an efficient model compression technique, vector quantization (VQ) excels in…

Computer Vision and Pattern Recognition · Computer Science 2026-05-26 Zhong Wang , Zukang Xu , Xing Hu , Dawei Yang

Large Language Models (LLMs) face significant deployment challenges due to their substantial resource requirements. While low-bit quantized weights can reduce memory usage and improve inference efficiency, current hardware lacks native…

Machine Learning · Computer Science 2025-06-10 Pengxiang Zhao , Xiaoming Yuan

Efficient deployment of Large Language Models (LLMs) requires batching multiple requests together to improve throughput. As the batch size, context length, or model size increases, the size of the key and value (KV) cache can quickly become…

Machine Learning · Computer Science 2024-05-08 Tianyi Zhang , Jonah Yi , Zhaozhuo Xu , Anshumali Shrivastava

Large Language Models (LLMs) have revolutionized natural language processing tasks. However, their practical application is constrained by substantial memory and computational demands. Post-training quantization (PTQ) is considered an…

Machine Learning · Computer Science 2025-07-29 Chao Zeng , Songwei Liu , Yusheng Xie , Hong Liu , Xiaojian Wang , Miao Wei , Shu Yang , Fangmin Chen , Xing Mei

LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a…

Machine Learning · Computer Science 2025-05-30 Coleman Hooper , Sehoon Kim , Hiva Mohammadzadeh , Michael W. Mahoney , Yakun Sophia Shao , Kurt Keutzer , Amir Gholami

KV cache compression methods have mainly relied on scalar quantization techniques to reduce the memory requirements during decoding. In this work, we apply residual vector quantization, which has been widely used for high fidelity audio…

Machine Learning · Computer Science 2024-10-22 Ankur Kumar

Recently the generative Large Language Model (LLM) has achieved remarkable success in numerous applications. Notably its inference generates output tokens one-by-one, leading to many redundant computations. The widely-used KV-Cache…

Machine Learning · Computer Science 2024-12-10 Weizhuo Li , Zhigang Wang , Yu Gu , Ge Yu

Cost of serving large language models (LLM) is high, but the expensive and scarce GPUs are poorly efficient when generating tokens sequentially, unless the batch of sequences is enlarged. However, the batch size is limited by some…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-19 Jiaao He , Jidong Zhai

Large language models (LLMs) can now handle longer sequences of tokens, enabling complex tasks like book understanding and generating lengthy novels. However, the key-value (KV) cache required for LLMs consumes substantial memory as context…

Machine Learning · Computer Science 2024-11-13 Haojie Duanmu , Zhihang Yuan , Xiuhong Li , Jiangfei Duan , Xingcheng Zhang , Dahua Lin

Generating texts with a large language model (LLM) consumes massive amounts of memory. Apart from the already-large model parameters, the key/value (KV) cache that holds information about previous tokens in a sequence can grow to be even…

Hardware Architecture · Computer Science 2023-06-12 Yunho Jin , Chun-Feng Wu , David Brooks , Gu-Yeon Wei

Inference for Large Language Models (LLMs) is computationally demanding. To reduce the cost of auto-regressive decoding, Key-Value (KV) cache is used to store intermediate activations, which significantly lowers the computational overhead…

Machine Learning · Computer Science 2025-06-05 Chaoyi Jiang , Lei Gao , Hossein Entezari Zarch , Murali Annavaram

Large Language Models (LLMs) have demonstrated remarkable proficiency across a wide range of tasks. However, LLMs often require larger batch sizes to enhance throughput or longer context lengths to meet task demands, which significantly…

Machine Learning · Computer Science 2025-05-23 Zhihang Cai , Xingjun Zhang , Zhendong Tan , Zheng Wei

The rapid scaling of Large Language Models (LLMs) elevates inference costs and compounds substantial deployment barriers. While quantization to 8 or 4 bits mitigates this, sub-3-bit methods face severe accuracy, scalability, and efficiency…

‹ Prev 1 2 3 10 Next ›