English

Quantization without Tears

Computer Vision and Pattern Recognition 2025-07-09 v4

Abstract

Deep neural networks, while achieving remarkable success across diverse tasks, demand significant resources, including computation, GPU memory, bandwidth, storage, and energy. Network quantization, as a standard compression and acceleration technique, reduces storage costs and enables potential inference acceleration by discretizing network weights and activations into a finite set of integer values. However, current quantization methods are often complex and sensitive, requiring extensive task-specific hyperparameters, where even a single misconfiguration can impair model performance, limiting generality across different models and tasks. In this paper, we propose Quantization without Tears (QwT), a method that simultaneously achieves quantization speed, accuracy, simplicity, and generality. The key insight of QwT is to incorporate a lightweight additional structure into the quantized network to mitigate information loss during quantization. This structure consists solely of a small set of linear layers, keeping the method simple and efficient. More importantly, it provides a closed-form solution, allowing us to improve accuracy effortlessly under 2 minutes. Extensive experiments across various vision, language, and multimodal tasks demonstrate that QwT is both highly effective and versatile. In fact, our approach offers a robust solution for network quantization that combines simplicity, accuracy, and adaptability, which provides new insights for the design of novel quantization paradigms. The code is publicly available at https://github.com/wujx2001/QwT

Keywords

Cite

@article{arxiv.2411.13918,
  title  = {Quantization without Tears},
  author = {Minghao Fu and Hao Yu and Jie Shao and Junjie Zhou and Ke Zhu and Jianxin Wu},
  journal= {arXiv preprint arXiv:2411.13918},
  year   = {2025}
}

Comments

CVPR 2025. The code is publicly available at https://github.com/wujx2001/QwT