Related papers: SQuAT: Sharpness- and Quantization-Aware Training …

Error-aware Quantization through Noise Tempering

Quantization has become a predominant approach for model compression, enabling deployment of large models trained on GPUs onto smaller form-factor devices for inference. Quantization-aware training (QAT) optimizes model parameters with…

Machine Learning · Computer Science 2022-12-13 Zheng Wang , Juncheng B Li , Shuhui Qu , Florian Metze , Emma Strubell

Regularization-based Framework for Quantization-, Fault- and Variability-Aware Training

Efficient inference is critical for deploying deep learning models on edge AI devices. Low-bit quantization (e.g., 3- and 4-bit) with fixed-point arithmetic improves efficiency, while low-power memory technologies like analog nonvolatile…

Machine Learning · Computer Science 2025-07-15 Anmol Biswas , Raghav Singhal , Sivakumar Elangovan , Shreyas Sabnis , Udayan Ganguly

Quantization Variation: A New Perspective on Training Transformers with Low-Bit Precision

Despite the outstanding performance of transformers in both language and vision tasks, the expanding computation and model size have increased the demand for efficient deployment. To address the heavy computation and parameter drawbacks,…

Machine Learning · Computer Science 2024-10-15 Xijie Huang , Zhiqiang Shen , Pingcheng Dong , Kwang-Ting Cheng

Compute-Optimal Quantization-Aware Training

Quantization-aware training (QAT) is a leading technique for improving the accuracy of quantized neural networks. Previous work has shown that decomposing training into a full-precision (FP) phase followed by a QAT phase yields superior…

Machine Learning · Computer Science 2026-02-27 Aleksandr Dremov , David Grangier , Angelos Katharopoulos , Awni Hannun

GAQAT: gradient-adaptive quantization-aware training for domain generalization

Research on loss surface geometry, such as Sharpness-Aware Minimization (SAM), shows that flatter minima improve generalization. Recent studies further reveal that flatter minima can also reduce the domain generalization (DG) gap. However,…

Computer Vision and Pattern Recognition · Computer Science 2024-12-10 Jiacheng Jiang , Yuan Meng , Chen Tang , Han Yu , Qun Li , Zhi Wang , Wenwu Zhu

SQUAT: Stateful Quantization-Aware Training in Recurrent Spiking Neural Networks

Weight quantization is used to deploy high-performance deep learning models on resource-limited hardware, enabling the use of low-precision integers for storage and computation. Spiking neural networks (SNNs) share the goal of enhancing…

Neural and Evolutionary Computing · Computer Science 2024-05-01 Sreyes Venkatesh , Razvan Marinescu , Jason K. Eshraghian

A Statistical Framework for Low-bitwidth Training of Deep Neural Networks

Fully quantized training (FQT), which uses low-bitwidth hardware by quantizing the activations, weights, and gradients of a neural network model, is a promising approach to accelerate the training of deep neural networks. One major…

Machine Learning · Computer Science 2020-10-28 Jianfei Chen , Yu Gai , Zhewei Yao , Michael W. Mahoney , Joseph E. Gonzalez

Improving Quantization-aware Training of Low-Precision Network via Block Replacement on Full-Precision Counterpart

Quantization-aware training (QAT) is a common paradigm for network quantization, in which the training phase incorporates the simulation of the low-precision computation to optimize the quantization parameters in alignment with the task…

Machine Learning · Computer Science 2024-12-23 Chengting Yu , Shu Yang , Fengzhao Zhang , Hanzhi Ma , Aili Wang , Er-Ping Li

A Study of Quantisation-aware Training on Time Series Transformer Models for Resource-constrained FPGAs

This study explores the quantisation-aware training (QAT) on time series Transformer models. We propose a novel adaptive quantisation scheme that dynamically selects between symmetric and asymmetric schemes during the QAT phase. Our…

Machine Learning · Computer Science 2023-10-05 Tianheng Ling , Chao Qian , Lukas Einhaus , Gregor Schiele

Low-Rank Quantization-Aware Training for LLMs

Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and…

Machine Learning · Computer Science 2024-09-04 Yelysei Bondarenko , Riccardo Del Chiaro , Markus Nagel

Sharpness-aware Quantization for Deep Neural Networks

Network quantization is a dominant paradigm of model compression. However, the abrupt changes in quantized weights during training often lead to severe loss fluctuations and result in a sharp loss landscape, making the gradients unstable…

Computer Vision and Pattern Recognition · Computer Science 2023-03-22 Jing Liu , Jianfei Cai , Bohan Zhuang

Quantization Meets OOD: Generalizable Quantization-aware Training from a Flatness Perspective

Current quantization-aware training (QAT) methods primarily focus on enhancing the performance of quantized models on in-distribution (I.D) data, while overlooking the potential performance degradation on out-of-distribution (OOD) data. In…

Computer Vision and Pattern Recognition · Computer Science 2025-09-09 Jiacheng Jiang , Yuan Meng , Chen Tang , Han Yu , Qun Li , Zhi Wang , Wenwu Zhu

FP8-BERT: Post-Training Quantization for Transformer

Transformer-based models, such as BERT, have been widely applied in a wide range of natural language processing tasks. However, one inevitable side effect is that they require massive memory storage and inference cost when deployed in…

Artificial Intelligence · Computer Science 2023-12-13 Jianwei Li , Tianchi Zhang , Ian En-Hsu Yen , Dongkuan Xu

EfQAT: An Efficient Framework for Quantization-Aware Training

Quantization-aware training (QAT) schemes have been shown to achieve near-full precision accuracy. They accomplish this by training a quantized model for multiple epochs. This is computationally expensive, mainly because of the full…

Machine Learning · Computer Science 2024-11-19 Saleh Ashkboos , Bram Verhoef , Torsten Hoefler , Evangelos Eleftheriou , Martino Dazzi

WinQ: Accelerating Quantization-Aware Training of Language Models Around Saddle Points

Quantization-aware training (QAT) is widely adopted to quantize language models by training full-precision weights using gradients from the quantized model. The main bottleneck is its slow convergence and early performance plateau,…

Machine Learning · Computer Science 2026-05-20 Dongyue Li , Zechun Liu , Kai Yi , Zhenshuo Zhang , Changsheng Zhao , Raghuraman Krishnamoorthi , Harshit Khaitan , Hongyang R. Zhang , Steven Li

RAND: Robustness Aware Norm Decay For Quantized Seq2seq Models

With the rapid increase in the size of neural networks, model compression has become an important area of research. Quantization is an effective technique at decreasing the model size, memory access, and compute load of large models.…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-26 David Qiu , David Rim , Shaojin Ding , Oleg Rybakov , Yanzhang He

Post-Training Sparsity-Aware Quantization

Quantization is a technique used in deep neural networks (DNNs) to increase execution performance and hardware efficiency. Uniform post-training quantization (PTQ) methods are common, since they can be implemented efficiently in hardware…

Machine Learning · Computer Science 2021-10-29 Gil Shomron , Freddy Gabbay , Samer Kurzum , Uri Weiser

Squat: Quant Small Language Models on the Edge

A growing trend has emerged in designing high-quality Small Language Models (SLMs) with a few million parameters. This trend is driven by the increasing concerns over cloud costs, privacy, and latency. Considering that full parameter…

Machine Learning · Computer Science 2025-07-03 Xuan Shen , Peiyan Dong , Zhenglun Kong , Yifan Gong , Changdi Yang , Zhaoyang Han , Yanyue Xie , Lei Lu , Cheng Lyu , Chao Wu , Yanzhi Wang , Pu Zhao

What Makes Low-Bit Quantization-Aware Training Work for Reasoning LLMs? A Systematic Study

Reasoning models excel at complex tasks such as coding and mathematics, yet their inference is often slow and token-inefficient. To improve the inference efficiency, post-training quantization (PTQ) usually comes with the cost of large…

Machine Learning · Computer Science 2026-01-22 Keyu Lv , Manyi Zhang , Xiaobo Xia , Jingchen Ni , Shannan Yan , Xianzhi Yu , Lu Hou , Chun Yuan , Haoli Bai

Teacher Intervention: Improving Convergence of Quantization Aware Training for Ultra-Low Precision Transformers

Pre-trained Transformer models such as BERT have shown great success in a wide range of applications, but at the cost of substantial increases in model complexity. Quantization-aware training (QAT) is a promising method to lower the…

Computation and Language · Computer Science 2023-02-24 Minsoo Kim , Kyuhong Shim , Seongmin Park , Wonyong Sung , Jungwook Choi