Related papers: Identifying Sensitive Weights via Post-quantizatio…

PD-Quant: Post-Training Quantization based on Prediction Difference Metric

Post-training quantization (PTQ) is a neural network compression technique that converts a full-precision model into a quantized model using lower-precision data types. Although it can help reduce the size and computational cost of deep…

Computer Vision and Pattern Recognition · Computer Science 2023-03-28 Jiawei Liu , Lin Niu , Zhihang Yuan , Dawei Yang , Xinggang Wang , Wenyu Liu

Rethinking Post-Training Quantization: Introducing a Statistical Pre-Calibration Approach

As Large Language Models (LLMs) become increasingly computationally complex, developing efficient deployment strategies, such as quantization, becomes crucial. State-of-the-art Post-training Quantization (PTQ) techniques often rely on…

Machine Learning · Computer Science 2025-01-17 Alireza Ghaffari , Sharareh Younesian , Boxing Chen , Vahid Partovi Nia , Masoud Asgharian

D$^2$Quant: Accurate Low-bit Post-Training Weight Quantization for LLMs

Large language models (LLMs) deliver strong performance, but their high compute and memory costs make deployment difficult in resource-constrained scenarios. Weight-only post-training quantization (PTQ) is appealing, as it reduces memory…

Machine Learning · Computer Science 2026-02-09 Xianglong Yan , ChengZhu Bao , Zhiteng Li , Tianao Zhang , Shaoqiu Zhang , Ruobing Xie , Samm Sun , Yulun Zhang

Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners

The quantization of large language models (LLMs) has been a prominent research area aimed at enabling their lightweight deployment in practice. Existing research about LLM's quantization has mainly explored the interplay between weights and…

Computation and Language · Computer Science 2025-05-16 Yifei Gao , Jie Ou , Lei Wang , Jun Cheng , Mengchu Zhou

CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression

Post-Training Quantization (PTQ) is an effective technique for compressing Large Language Models (LLMs). While many studies focus on quantizing both weights and activations, it is still a challenge to maintain the accuracy of LLM after…

Machine Learning · Computer Science 2024-10-11 Wenyuan Liu , Xindian Ma , Peng Zhang , Yan Wang

LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices

With the commercialization of large language models (LLMs), weight-activation quantization has emerged to compress and accelerate LLMs, achieving high throughput while reducing inference costs. However, existing post-training quantization…

Machine Learning · Computer Science 2025-02-11 Jung Hyun Lee , Jeonghoon Kim , June Yong Yang , Se Jung Kwon , Eunho Yang , Kang Min Yoo , Dongsoo Lee

Achieving binary weight and activation for LLMs using Post-Training Quantization

Quantizing large language models (LLMs) to 1-bit precision significantly reduces computational costs, but existing quantization techniques suffer from noticeable performance degradation when using weight and activation precisions below 4…

Machine Learning · Computer Science 2025-07-01 Siqing Song , Chuang Wang , Ruiqi Wang , Yi Yang , Xu-Yao Zhang

Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models

Large Language Models (LLMs) deliver strong performance across a wide range of NLP tasks, but their massive sizes hinder deployment on resource-constrained devices. To reduce their computational and memory burden, various compression…

Machine Learning · Computer Science 2026-05-18 Dung Anh Hoang , Cuong Pham , Cuong Nguyen , Trung le , Jianfei Cai , Thanh-Toan Do

Sensitivity-Aware Post-Training Quantization for Deep Neural Networks

Model quantization reduces neural network parameter precision to achieve compression, but often compromises accuracy. Existing post-training quantization (PTQ) methods employ iterative parameter updates to preserve accuracy under high…

Computer Vision and Pattern Recognition · Computer Science 2025-09-09 Zekang Zheng , Haokun Li , Yaofo Chen , Mingkui Tan , Qing Du

Activation Sensitivity as a Unifying Principle for Post-Training Quantization

Post-training quantization (PTQ) methods for large language models rely on heuristics that implicitly estimate which weight channels most strongly influence model behavior. Two dominant paradigms have emerged: activation-aware methods such…

Machine Learning · Computer Science 2026-01-21 Bruce Changlong Xu

ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation

Post-training quantization (PTQ) has emerged as a promising technique for mitigating memory consumption and computational costs in large language models (LLMs). However, a systematic examination of various quantization schemes, model…

Machine Learning · Computer Science 2023-05-29 Zhewei Yao , Xiaoxia Wu , Cheng Li , Stephen Youn , Yuxiong He

SEPTQ: A Simple and Effective Post-Training Quantization Paradigm for Large Language Models

Large language models (LLMs) have shown remarkable performance in various domains, but they are constrained by massive computational and storage costs. Quantization, an effective technique for compressing models to fit resource-limited…

Computation and Language · Computer Science 2026-04-14 Han Liu , Haotian Gao , Xiaotong Zhang , Changya Li , Feng Zhang , Wei Wang , Fenglong Ma , Hong Yu

PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models

Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization. Several existing sub 2-bit post-training quantization (PTQ) methods utilize a mix-precision scheme by leveraging an…

Machine Learning · Computer Science 2025-08-07 Jiaqi Zhao , Miao Zhang , Ming Wang , Yuzhang Shang , Kaihao Zhang , Weili Guan , Yaowei Wang , Min Zhang

BoA: Attention-aware Post-training Quantization without Backpropagation

Post-training quantization (PTQ) is a promising solution for deploying large language models (LLMs) on resource-constrained devices. Early methods developed for small-scale networks, such as ResNet, rely on gradient-based optimization,…

Machine Learning · Computer Science 2025-06-09 Junhan Kim , Ho-young Kim , Eulrang Cho , Chungman Lee , Joonyoung Kim , Yongkweon Jeon

DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models

Improving the efficiency of inference in Large Language Models (LLMs) is a critical area of research. Post-training Quantization (PTQ) is a popular technique, but it often faces challenges at low-bit levels, particularly in downstream…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Wenjin Ke , Zhe Li , Dong Li , Lu Tian , Emad Barsoum

BAQ: Efficient Bit Allocation Quantization for Large Language Models

Post-training model quantization is a widely adopted technique for reducing the memory and computational costs of large language models (LLMs). However, most existing methods rely on uniform or heuristic bitwidth assignments, failing to…

Machine Learning · Computer Science 2025-06-09 Chao Zhang , Li Wang , Samson Lasaulce , Merouane Debbah

FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation

Post-training quantization (PTQ) has stood out as a cost-effective and promising model compression paradigm in recent years, as it avoids computationally intensive model retraining. Nevertheless, current PTQ methods for Vision Transformers…

Computer Vision and Pattern Recognition · Computer Science 2025-06-16 Zhuguanyu Wu , Shihe Wang , Jiayi Zhang , Jiaxin Chen , Yunhong Wang

DAQ: Density-Aware Post-Training Weight-Only Quantization For LLMs

Large language models (LLMs) excel in various tasks but face deployment challenges due to hardware constraints. We propose density-aware post-training weight-only quantization (DAQ), which has two stages: 1) density-centric alignment, which…

Machine Learning · Computer Science 2024-10-18 Yingsong Luo , Ling Chen

GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance

Post-training quantization is a key technique for reducing the memory and inference latency of large language models by quantizing weights and activations without requiring retraining. However, existing methods either (1) fail to account…

Machine Learning · Computer Science 2025-09-23 Jinuk Kim , Marwa El Halabi , Wonpyo Park , Clemens JS Schaefer , Deokjae Lee , Yeonhong Park , Jae W. Lee , Hyun Oh Song

QuIP: 2-Bit Quantization of Large Language Models With Guarantees

This work studies post-training parameter quantization in large language models (LLMs). We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from $\textit{incoherent}$…

Machine Learning · Computer Science 2024-01-17 Jerry Chee , Yaohui Cai , Volodymyr Kuleshov , Christopher De Sa