QTIP: Quantization with Trellises and Incoherence Processing

Albert Tseng; Qingyao Sun; David Hou; Christopher De Sa

QTIP: Quantization with Trellises and Incoherence Processing

Machine Learning 2025-06-19 v4

Authors: Albert Tseng , Qingyao Sun , David Hou , Christopher De Sa

Abstract

Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing weights to low-precision datatypes. Since LLM inference is usually memory-bound, PTQ methods can improve inference throughput. Recent state-of-the-art PTQ approaches use vector quantization (VQ) to quantize multiple weights at once, which improves information utilization through better shaping. However, VQ requires a codebook with size exponential in the dimension. This limits current VQ-based PTQ works to low VQ dimensions ( $\le 8$ ) that in turn limit quantization quality. Here, we introduce QTIP, which instead uses trellis coded quantization (TCQ) to achieve ultra-high-dimensional quantization. TCQ uses a stateful decoder that separates the codebook size from the bitrate and effective dimension. QTIP introduces a spectrum of lookup-only to computed lookup-free trellis codes designed for a hardware-efficient "bitshift" trellis structure; these codes achieve state-of-the-art results in both quantization quality and inference speed.

Keywords

quantization quantum computing

Cite

@article{arxiv.2406.11235,
  title  = {QTIP: Quantization with Trellises and Incoherence Processing},
  author = {Albert Tseng and Qingyao Sun and David Hou and Christopher De Sa},
  journal= {arXiv preprint arXiv:2406.11235},
  year   = {2025}
}

Comments

NeurIPS 2024 Spotlight

QTIP: Quantization with Trellises and Incoherence Processing

Abstract

Keywords

Cite

Comments

Related papers