Related papers: Binary Quadratic Quantization: Beyond First-Order …

Simultaneous Compression and Quantization: A Joint Approach for Efficient Unsupervised Hashing

For unsupervised data-dependent hashing, the two most important requirements are to preserve similarity in the low-dimensional feature space and to minimize the binary quantization loss. A well-established hashing approach is Iterative…

Computer Vision and Pattern Recognition · Computer Science 2019-11-14 Tuan Hoang , Thanh-Toan Do , Huu Le , Dang-Khoa Le-Tan , Ngai-Man Cheung

Pruning Ternary Quantization

Inference time, model size, and accuracy are three key factors in deep model compression. Most of the existing work addresses these three key factors separately as it is difficult to optimize them all at the same time. For example, low-bit…

Computer Vision and Pattern Recognition · Computer Science 2023-07-18 Dan Liu , Xi Chen , Jie Fu , Chen Ma , Xue Liu

Coordinate Heterogeneity Governs Binary Quantization: From InfoNCE to Recall

Binary quantization (BQ) compresses high-dimensional embeddings into one or two bits per coordinate, enabling nearest neighbor search at extreme speed. Yet a striking puzzle persists: BQ achieves competitive recall on contrastive embeddings…

Machine Learning · Computer Science 2026-05-19 Wenxuan Xiao

Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization

Text-to-image diffusion models have emerged as a powerful framework for high-quality image generation given textual prompts. Their success has driven the rapid development of production-grade diffusion models that consistently increase in…

Computer Vision and Pattern Recognition · Computer Science 2024-09-04 Vage Egiazarian , Denis Kuznedelev , Anton Voronov , Ruslan Svirschevski , Michael Goin , Daniil Pavlov , Dan Alistarh , Dmitry Baranchuk

BAQ: Efficient Bit Allocation Quantization for Large Language Models

Post-training model quantization is a widely adopted technique for reducing the memory and computational costs of large language models (LLMs). However, most existing methods rely on uniform or heuristic bitwidth assignments, failing to…

Machine Learning · Computer Science 2025-06-09 Chao Zhang , Li Wang , Samson Lasaulce , Merouane Debbah

Training Multi-bit Quantized and Binarized Networks with A Learnable Symmetric Quantizer

Quantizing weights and activations of deep neural networks is essential for deploying them in resource-constrained devices, or cloud platforms for at-scale services. While binarization is a special case of quantization, this extreme case…

Computer Vision and Pattern Recognition · Computer Science 2021-04-02 Phuoc Pham , Jacob Abraham , Jaeyong Chung

BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs

The number of parameters in deep neural networks (DNNs) is rapidly increasing to support complicated tasks and to improve model accuracy. Correspondingly, the amount of computations and required memory footprint increase as well.…

Machine Learning · Computer Science 2020-09-01 Yongkweon Jeon , Baeseong Park , Se Jung Kwon , Byeongwook Kim , Jeongin Yun , Dongsoo Lee

Addition is almost all you need: Compressing large language models with double binary factorization

Binary quantization approaches, which replace weight matrices with binary matrices and substitute costly multiplications with cheaper additions, offer a computationally efficient approach to address the increasing computational and storage…

Machine Learning · Computer Science 2026-03-03 Vladimír Boža , Vladimír Macko

Image and Video Tokenization with Binary Spherical Quantization

We propose a new transformer-based image and video tokenizer with Binary Spherical Quantization (BSQ). BSQ projects the high-dimensional visual embedding to a lower-dimensional hypersphere and then applies binary quantization. BSQ is (1)…

Computer Vision and Pattern Recognition · Computer Science 2024-06-12 Yue Zhao , Yuanjun Xiong , Philipp Krähenbühl

Neural Network Compression using Binarization and Few Full-Precision Weights

Quantization and pruning are two effective Deep Neural Networks model compression methods. In this paper, we propose Automatic Prune Binarization (APB), a novel compression technique combining quantization with pruning. APB enhances the…

Computer Vision and Pattern Recognition · Computer Science 2023-09-18 Franco Maria Nardini , Cosimo Rulli , Salvatore Trani , Rossano Venturini

BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization

Mixed-precision quantization can potentially achieve the optimal tradeoff between performance and compression rate of deep neural networks, and thus, have been widely investigated. However, it lacks a systematic method to determine the…

Machine Learning · Computer Science 2021-02-23 Huanrui Yang , Lin Duan , Yiran Chen , Hai Li

UWC: Unit-wise Calibration Towards Rapid Network Compression

This paper introduces a post-training quantization~(PTQ) method achieving highly efficient Convolutional Neural Network~ (CNN) quantization with high performance. Previous PTQ methods usually reduce compression error via performing…

Computer Vision and Pattern Recognition · Computer Science 2022-01-19 Chen Lin , Zheyang Li , Bo Peng , Haoji Hu , Wenming Tan , Ye Ren , Shiliang Pu

Embedding Compression with Isotropic Iterative Quantization

Continuous representation of words is a standard component in deep learning-based NLP models. However, representing a large vocabulary requires significant memory, which can cause problems, particularly on resource-constrained platforms.…

Computation and Language · Computer Science 2020-01-24 Siyu Liao , Jie Chen , Yanzhi Wang , Qinru Qiu , Bo Yuan

PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models

Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization. Several existing sub 2-bit post-training quantization (PTQ) methods utilize a mix-precision scheme by leveraging an…

Machine Learning · Computer Science 2025-08-07 Jiaqi Zhao , Miao Zhang , Ming Wang , Yuzhang Shang , Kaihao Zhang , Weili Guan , Yaowei Wang , Min Zhang

SBVR: Summation of BitVector Representation for Efficient LLM Quantization

With the advent of large language models (LLMs), numerous Post-Training Quantization (PTQ) strategies have been proposed to alleviate deployment barriers created by their enormous parameter counts. Quantization achieves compression by…

Machine Learning · Computer Science 2025-09-24 Wonjun Bang , Jongseok Park , Hongseung Yu , Kyungmin Bin , Kyunghan Lee

Post-Training Quantization for Video Matting

Video matting is crucial for applications such as film production and virtual reality, yet deploying its computationally intensive models on resource-constrained devices presents challenges. Quantization is a key technique for model…

Computer Vision and Pattern Recognition · Computer Science 2025-06-13 Tianrui Zhu , Houyuan Chen , Ruihao Gong , Michele Magno , Haotong Qin , Kai Zhang

Least squares binary quantization of neural networks

Quantizing weights and activations of deep neural networks results in significant improvement in inference efficiency at the cost of lower accuracy. A source of the accuracy gap between full precision and quantized models is the…

Machine Learning · Computer Science 2020-06-16 Hadi Pouransari , Zhucheng Tu , Oncel Tuzel

QQQ: Quality Quattuor-Bit Quantization for Large Language Models

Quantization is a proven effective method for compressing large language models. Although popular techniques like W8A8 and W4A16 effectively maintain model performance, they often fail to concurrently speed up the prefill and decoding…

Machine Learning · Computer Science 2024-08-01 Ying Zhang , Peng Zhang , Mincong Huang , Jingyang Xiang , Yujie Wang , Chao Wang , Yineng Zhang , Lei Yu , Chuan Liu , Wei Lin

LO-BCQ: Block Clustered Quantization for 4-bit (W4A4) LLM Inference

Post-training quantization (PTQ) is a promising approach to reducing the storage and computational requirements of large language models (LLMs) without additional training cost. Recent PTQ studies have primarily focused on quantizing only…

Machine Learning · Computer Science 2026-02-17 Reena Elangovan , Charbel Sakr , Anand Raghunathan , Brucek Khailany

RAPQ: Rescuing Accuracy for Power-of-Two Low-bit Post-training Quantization

We introduce a Power-of-Two low-bit post-training quantization(PTQ) method for deep neural network that meets hardware requirements and does not call for long-time retraining. Power-of-Two quantization can convert the multiplication…

Computer Vision and Pattern Recognition · Computer Science 2022-09-27 Hongyi Yao , Pu Li , Jian Cao , Xiangcheng Liu , Chenying Xie , Bingzhang Wang