Related papers: VQ-LLM: High-performance Code Generation for Vecto…

CommVQ: Commutative Vector Quantization for KV Cache Compression

Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the key-value (KV) cache often becomes a memory bottleneck on GPUs as context grows. To address this, we propose Commutative Vector…

Computation and Language · Computer Science 2025-06-24 Junyan Li , Yang Zhang , Muhammad Yusuf Hassan , Talha Chafekar , Tianle Cai , Zhile Ren , Pengsheng Guo , Foroozan Karimzadeh , Colorado Reed , Chong Wang , Chuang Gan

VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization

The growing context length of Large Language Models (LLMs) enlarges the Key-Value (KV) cache, limiting deployment in resource-limited environments. Prior training-free approaches for KV cache compression typically rely on low-rank…

Computation and Language · Computer Science 2026-03-18 Yixuan Wang , Qingyu Shi , Jiayu Zhou , Dianbo Liu , Ziwei He , Zhouhan Lin

VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization

The Key-Value (KV) cache introduces substantial memory overhead during large language model (LLM) inference. Although existing vector quantization (VQ) methods reduce KV cache usage and provide flexible representational capacity across…

Computation and Language · Computer Science 2025-10-08 Dingyu Yao , Chenxu Yang , Zhengyang Tong , Zheng Lin , Wei Liu , Jian Luan , Weiping Wang

LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization

Recent breakthroughs in Large-scale language models (LLMs) have demonstrated impressive performance on various tasks. The immense sizes of LLMs have led to very high resource demand and cost for running the models. Though the models are…

Machine Learning · Computer Science 2024-03-05 Juntao Zhao , Borui Wan , Yanghua Peng , Haibin Lin , Chuan Wu

Joint Encoding of KV-Cache Blocks for Scalable LLM Serving

Modern large language models (LLMs) drive interactive AI systems but are bottlenecked by the memory-heavy growth of key-value (KV) caches, which limits real-time throughput under concurrent loads. Existing KV-cache compression methods rely…

Machine Learning · Computer Science 2026-01-07 Joseph Kampeas , Emir Haleva

WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization

Recently, video language models (VLMs) have been applied in various fields. However, the visual token sequence of the VLM is too long, which may cause intolerant inference latency and GPU memory usage. Existing methods propose…

Computer Vision and Pattern Recognition · Computer Science 2026-05-05 Wei Tao , Xiaoyang Qu , Peiqiang Wang , Guokuan Li , Jiguang Wan , Kai Lu , Jianzong Wang

KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batch-size scenarios while preserving LLMs effectiveness. However, current methods have three unsolved issues:…

Machine Learning · Computer Science 2025-11-21 Xing Li , Zeyu Xing , Yiming Li , Linping Qu , Hui-Ling Zhen , Wulong Liu , Yiwu Yao , Sinno Jialin Pan , Mingxuan Yuan

MGVQ: Synergizing Multi-dimensional Sensitivity-Aware and Gradient-Hessian Fusion for Vector Quantization

Vision-Language Models (VLMs) achieve outstanding performance, yet their huge model size severely hinders deployment on edge devices with limited resources. As an efficient model compression technique, vector quantization (VQ) excels in…

Computer Vision and Pattern Recognition · Computer Science 2026-05-26 Zhong Wang , Zukang Xu , Xing Hu , Dawei Yang

GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models

Large Language Models (LLMs) face significant deployment challenges due to their substantial resource requirements. While low-bit quantized weights can reduce memory usage and improve inference efficiency, current hardware lacks native…

Machine Learning · Computer Science 2025-06-10 Pengxiang Zhao , Xiaoming Yuan

KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization

Efficient deployment of Large Language Models (LLMs) requires batching multiple requests together to improve throughput. As the batch size, context length, or model size increases, the size of the key and value (KV) cache can quickly become…

Machine Learning · Computer Science 2024-05-08 Tianyi Zhang , Jonah Yi , Zhaozhuo Xu , Anshumali Shrivastava

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

Large Language Models (LLMs) have revolutionized natural language processing tasks. However, their practical application is constrained by substantial memory and computational demands. Post-training quantization (PTQ) is considered an…

Machine Learning · Computer Science 2025-07-29 Chao Zeng , Songwei Liu , Yusheng Xie , Hong Liu , Xiaojian Wang , Miao Wei , Shu Yang , Fangmin Chen , Xing Mei

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a…

Machine Learning · Computer Science 2025-05-30 Coleman Hooper , Sehoon Kim , Hiva Mohammadzadeh , Michael W. Mahoney , Yakun Sophia Shao , Kurt Keutzer , Amir Gholami

Residual vector quantization for KV cache compression in large language model

KV cache compression methods have mainly relied on scalar quantization techniques to reduce the memory requirements during decoding. In this work, we apply residual vector quantization, which has been widely used for high fidelity audio…

Machine Learning · Computer Science 2024-10-22 Ankur Kumar

XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference

Recently the generative Large Language Model (LLM) has achieved remarkable success in numerous applications. Notably its inference generates output tokens one-by-one, leading to many redundant computations. The widely-used KV-Cache…

Machine Learning · Computer Science 2024-12-10 Weizhuo Li , Zhigang Wang , Yu Gu , Ge Yu

FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

Cost of serving large language models (LLM) is high, but the expensive and scarce GPUs are poorly efficient when generating tokens sequentially, unless the batch of sequences is enlarged. However, the batch size is limited by some…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-19 Jiaao He , Jidong Zhai

SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models

Large language models (LLMs) can now handle longer sequences of tokens, enabling complex tasks like book understanding and generating lengthy novels. However, the key-value (KV) cache required for LLMs consumes substantial memory as context…

Machine Learning · Computer Science 2024-11-13 Haojie Duanmu , Zhihang Yuan , Xiuhong Li , Jiangfei Duan , Xingcheng Zhang , Dahua Lin

S$^{3}$: Increasing GPU Utilization during Generative Inference for Higher Throughput

Generating texts with a large language model (LLM) consumes massive amounts of memory. Apart from the already-large model parameters, the key/value (KV) cache that holds information about previous tokens in a sequence can grow to be even…

Hardware Architecture · Computer Science 2023-06-12 Yunho Jin , Chun-Feng Wu , David Brooks , Gu-Yeon Wei

KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation

Inference for Large Language Models (LLMs) is computationally demanding. To reduce the cost of auto-regressive decoding, Key-Value (KV) cache is used to store intermediate activations, which significantly lowers the computational overhead…

Machine Learning · Computer Science 2025-06-05 Chaoyi Jiang , Lei Gao , Hossein Entezari Zarch , Murali Annavaram

NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics

Large Language Models (LLMs) have demonstrated remarkable proficiency across a wide range of tasks. However, LLMs often require larger batch sizes to enhance throughput or longer context lengths to meet task demands, which significantly…

Machine Learning · Computer Science 2025-05-23 Zhihang Cai , Xingjun Zhang , Zhendong Tan , Zheng Wei

CCQ: Convolutional Code for Extreme Low-bit Quantization in LLMs

The rapid scaling of Large Language Models (LLMs) elevates inference costs and compounds substantial deployment barriers. While quantization to 8 or 4 bits mitigates this, sub-3-bit methods face severe accuracy, scalability, and efficiency…

Machine Learning · Computer Science 2025-07-11 Zhaojing Zhou , Xunchao Li , Minghao Li , Handi Zhang , Haoshuang Wang , Wenbin Chang , Yiqun Liu , Qingqing Dang , Dianhai Yu , Yanjun Ma , Haifeng Wang