Related papers: Leech Lattice Vector Quantization for Efficient LL…

Spherical Leech Quantization for Visual Tokenization and Generation

Non-parametric quantization has received much attention due to its efficiency on parameters and scalability to a large codebook. In this paper, we present a unified formulation of different non-parametric quantization methods through the…

Computer Vision and Pattern Recognition · Computer Science 2025-12-17 Yue Zhao , Hanwen Jiang , Zhenlin Xu , Chutong Yang , Ehsan Adeli , Philipp Krähenbühl

Pyramid Vector Quantization for LLMs

Recent works on compression of large language models (LLM) using quantization considered reparameterizing the architecture such that weights are distributed on the sphere. This demonstratively improves the ability to quantize by increasing…

Machine Learning · Computer Science 2024-12-05 Tycho F. A. van der Ouderaa , Maximilian L. Croci , Agrin Hilmkil , James Hensman

Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression

Large Language Models (LLMs) have demonstrated remarkable capabilities but typically require extensive computational resources and memory for inference. Post-training quantization (PTQ) can effectively reduce these demands by storing…

Machine Learning · Computer Science 2026-01-27 Xi Zhang , Xiaolin Wu , Jiamang Wang , Weisi Lin

LCQ: Low-Rank Codebook based Quantization for Large Language Models

Large language models~(LLMs) have recently demonstrated promising performance in many tasks. However, the high storage and computational cost of LLMs has become a challenge for deploying LLMs. Weight quantization has been widely used for…

Machine Learning · Computer Science 2025-02-11 Wen-Pu Cai , Ming-Yang Li , Wu-Jun Li

Learning Optimal Lattice Vector Quantizers for End-to-end Neural Image Compression

It is customary to deploy uniform scalar quantization in the end-to-end optimized Neural image compression methods, instead of more powerful vector quantization, due to the high complexity of the latter. Lattice vector quantization (LVQ),…

Image and Video Processing · Electrical Eng. & Systems 2024-11-26 Xi Zhang , Xiaolin Wu

Foundations of Large Language Model Compression -- Part 1: Weight Quantization

In recent years, compression of large language models (LLMs) has emerged as an important problem to enable language model deployment on resource-constrained devices, reduce computational costs, and mitigate the environmental footprint of…

Machine Learning · Computer Science 2024-10-04 Sean I. Young

LL-VQ-VAE: Learnable Lattice Vector-Quantization For Efficient Representations

In this paper we introduce learnable lattice vector quantization and demonstrate its effectiveness for learning discrete representations. Our method, termed LL-VQ-VAE, replaces the vector quantization layer in VQ-VAE with lattice-based…

Machine Learning · Computer Science 2023-10-17 Ahmed Khalil , Robert Piechocki , Raul Santos-Rodriguez

VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization

The growing context length of Large Language Models (LLMs) enlarges the Key-Value (KV) cache, limiting deployment in resource-limited environments. Prior training-free approaches for KV cache compression typically rely on low-rank…

Computation and Language · Computer Science 2026-03-18 Yixuan Wang , Qingyu Shi , Jiayu Zhou , Dianbo Liu , Ziwei He , Zhouhan Lin

Residual vector quantization for KV cache compression in large language model

KV cache compression methods have mainly relied on scalar quantization techniques to reduce the memory requirements during decoding. In this work, we apply residual vector quantization, which has been widely used for high fidelity audio…

Machine Learning · Computer Science 2024-10-22 Ankur Kumar

MGVQ: Synergizing Multi-dimensional Sensitivity-Aware and Gradient-Hessian Fusion for Vector Quantization

Vision-Language Models (VLMs) achieve outstanding performance, yet their huge model size severely hinders deployment on edge devices with limited resources. As an efficient model compression technique, vector quantization (VQ) excels in…

Computer Vision and Pattern Recognition · Computer Science 2026-05-26 Zhong Wang , Zukang Xu , Xing Hu , Dawei Yang

LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit

Recent advancements in large language models (LLMs) are propelling us toward artificial general intelligence with their remarkable emergent abilities and reasoning capabilities. However, the substantial computational and memory requirements…

Machine Learning · Computer Science 2024-10-10 Ruihao Gong , Yang Yong , Shiqiao Gu , Yushi Huang , Chengtao Lv , Yunchen Zhang , Xianglong Liu , Dacheng Tao

Learning Low-Rank Representations for Model Compression

Vector Quantization (VQ) is an appealing model compression method to obtain a tiny model with less accuracy loss. While methods to obtain better codebooks and codes under fixed clustering dimensionality have been extensively studied,…

Computer Vision and Pattern Recognition · Computer Science 2022-11-22 Zezhou Zhu , Yucong Zhou , Zhao Zhong

Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey

The rapid advancement of large language models (LLMs) has intensified the need for effective mechanisms to transform continuous multimodal data into discrete representations suitable for language-based processing. Discrete tokenization,…

Computation and Language · Computer Science 2025-08-01 Jindong Li , Yali Fu , Jiahong Liu , Linxiao Cao , Wei Ji , Menglin Yang , Irwin King , Ming-Hsuan Yang

LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment

As Large Language Models (LLMs) demonstrate exceptional performance across various domains, deploying LLMs on edge devices has emerged as a new trend. Quantization techniques, which reduce the size and memory requirements of LLMs, are…

Computation and Language · Computer Science 2025-05-07 Binrui Zeng , Bin Ji , Xiaodong Liu , Jie Yu , Shasha Li , Jun Ma , Xiaopeng Li , Shangwen Wang , Xinran Hong , Yongtao Tang

A Comprehensive Study on Quantization Techniques for Large Language Models

Large Language Models (LLMs) have been extensively researched and used in both academia and industry since the rise in popularity of the Transformer model, which demonstrates excellent performance in AI. However, the computational demands…

Machine Learning · Computer Science 2024-11-06 Jiedong Lang , Zhehao Guo , Shuyu Huang

XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks. However, their extensive memory requirements, particularly due to KV cache growth during long-text understanding and…

Computation and Language · Computer Science 2025-10-14 Haoqi Yang , Yao Yao , Zuchao Li , Baoyuan Qi , Guoming Liu , Hai Zhao

GPTVQ: The Blessing of Dimensionality for LLM Quantization

In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector…

Machine Learning · Computer Science 2025-06-04 Mart van Baalen , Andrey Kuzmin , Ivan Koryakovskiy , Markus Nagel , Peter Couperus , Cedric Bastoul , Eric Mahurin , Tijmen Blankevoort , Paul Whatmough

LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models

Recent advancements in Large Language Models (LLMs) have spurred interest in numerous applications requiring robust long-range capabilities, essential for processing extensive input contexts and continuously generating extended outputs. As…

Machine Learning · Computer Science 2025-07-22 Dachuan Shi , Yonggan Fu , Xiangchi Yuan , Zhongzhi Yu , Haoran You , Sixu Li , Xin Dong , Jan Kautz , Pavlo Molchanov , Yingyan , Lin

WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More

Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on…

Machine Learning · Computer Science 2024-02-21 Yuxuan Yue , Zhihang Yuan , Haojie Duanmu , Sifan Zhou , Jianlong Wu , Liqiang Nie

VQ-Logits: Compressing the Output Bottleneck of Large Language Models via Vector Quantized Logits

Large Language Models (LLMs) have achieved remarkable success but face significant computational and memory challenges, particularly due to their extensive output vocabularies. The final linear projection layer, mapping hidden states to…

Computation and Language · Computer Science 2025-05-16 Jintian Shao , Hongyi Huang , Jiayi Wu , YiMing Cheng , ZhiYu Wu , You Shan , MingKai Zheng