Related papers: Zero-Shot Sharpness-Aware Quantization for Pre-tra…

Zero-shot Adversarial Quantization

Model quantization is a promising approach to compress deep neural networks and accelerate inference, making it possible to be deployed on mobile and edge devices. To retain the high performance of full-precision models, most existing…

Computer Vision and Pattern Recognition · Computer Science 2021-03-31 Yuang Liu , Wei Zhang , Jun Wang

Sharpness-Aware Data Generation for Zero-shot Quantization

Zero-shot quantization aims to learn a quantized model from a pre-trained full-precision model with no access to original real training data. The common idea in zero-shot quantization approaches is to generate synthetic data for quantizing…

Machine Learning · Computer Science 2025-10-09 Dung Hoang-Anh , Cuong Pham Trung Le , Jianfei Cai , Thanh-Toan Do

Zero-shot Quantization: A Comprehensive Survey

Network quantization has proven to be a powerful approach to reduce the memory and computational demands of deep learning models for deployment on resource-constrained devices. However, traditional quantization methods often rely on access…

Computer Vision and Pattern Recognition · Computer Science 2025-05-15 Minjun Kim , Jaehyeon Choi , Jongkeun Lee , Wonjin Cho , U Kang

Long-Range Zero-Shot Generative Deep Network Quantization

Quantization approximates a deep network model with floating-point numbers by the one with low bit width numbers, in order to accelerate inference and reduce computation. Quantizing a model without access to the original data, zero-shot…

Computer Vision and Pattern Recognition · Computer Science 2022-11-18 Yan Luo , Yangcheng Gao , Zhao Zhang , Haijun Zhang , Mingliang Xu , Meng Wang

ZeroQ: A Novel Zero Shot Quantization Framework

Quantization is a promising approach for reducing the inference time and memory footprint of neural networks. However, most existing quantization methods require access to the original training dataset for retraining during quantization.…

Computer Vision and Pattern Recognition · Computer Science 2020-03-29 Yaohui Cai , Zhewei Yao , Zhen Dong , Amir Gholami , Michael W. Mahoney , Kurt Keutzer

SynQ: Accurate Zero-shot Quantization by Synthesis-aware Fine-tuning

How can we accurately quantize a pre-trained model without any data? Quantization algorithms are widely used for deploying neural networks on resource-constrained edge devices. Zero-shot Quantization (ZSQ) addresses the crucial and…

Computer Vision and Pattern Recognition · Computer Science 2026-03-20 Minjun Kim , Jongjin Kim , U Kang

Sharpness-aware Quantization for Deep Neural Networks

Network quantization is a dominant paradigm of model compression. However, the abrupt changes in quantized weights during training often lead to severe loss fluctuations and result in a sharp loss landscape, making the gradients unstable…

Computer Vision and Pattern Recognition · Computer Science 2023-03-22 Jing Liu , Jianfei Cai , Bohan Zhuang

Task-Specific Zero-shot Quantization-Aware Training for Object Detection

Quantization is a key technique to reduce network size and computational complexity by representing the network parameters with a lower precision. Traditional quantization methods rely on access to original training data, which is often…

Computer Vision and Pattern Recognition · Computer Science 2025-07-23 Changhao Li , Xinrui Chen , Ji Wang , Kang Zhao , Jianfei Chen

Low-Rank Quantization-Aware Training for LLMs

Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and…

Machine Learning · Computer Science 2024-09-04 Yelysei Bondarenko , Riccardo Del Chiaro , Markus Nagel

End-to-End On-Device Quantization-Aware Training for LLMs at Inference Cost

Quantization is an effective technique to reduce the deployment cost of large language models (LLMs), and post-training quantization (PTQ) has been widely studied due to its efficiency. However, existing PTQ methods are limited by their…

Machine Learning · Computer Science 2025-09-30 Qitao Tan , Xiaoying Song , Jin Lu , Guoming Li , Jun Liu , Lingzi Hong , Caiwen Ding , Jundong Li , Xiaoming Zhai , Shaoyi Huang , Wei Niu , Geng Yuan

Towards Efficient Post-training Quantization of Pre-trained Language Models

Network quantization has gained increasing attention with the rapid growth of large pre-trained language models~(PLMs). However, most existing quantization methods for PLMs follow quantization-aware training~(QAT) that requires end-to-end…

Computation and Language · Computer Science 2021-10-01 Haoli Bai , Lu Hou , Lifeng Shang , Xin Jiang , Irwin King , Michael R. Lyu

SQuAT: Sharpness- and Quantization-Aware Training for BERT

Quantization is an effective technique to reduce memory footprint, inference latency, and power consumption of deep learning models. However, existing quantization methods suffer from accuracy degradation compared to full-precision (FP)…

Machine Learning · Computer Science 2022-10-14 Zheng Wang , Juncheng B Li , Shuhui Qu , Florian Metze , Emma Strubell

Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques

This paper presents a comprehensive analysis of quantization techniques for optimizing Large Language Models (LLMs), specifically focusing on Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). Through empirical…

Machine Learning · Computer Science 2024-11-12 Jahid Hasan

GranQ: Efficient Channel-wise Quantization via Vectorized Pre-Scaling for Zero-Shot QAT

Zero-shot quantization (ZSQ) enables neural network compression without original training data, making it a promising solution for restricted data access scenarios. To compensate for the lack of data, recent ZSQ methods typically rely on…

Computer Vision and Pattern Recognition · Computer Science 2025-10-16 Inpyo Hong , Youngwan Jo , Hyojeong Lee , Sunghyun Ahn , Kijung Lee , Sanghyun Park

Sharpness-Aware Minimization Improves Language Model Generalization

The allure of superhuman-level capabilities has led to considerable interest in language models like GPT-3 and T5, wherein the research has, by and large, revolved around new model architectures, training tasks, and loss objectives, along…

Computation and Language · Computer Science 2022-03-17 Dara Bahri , Hossein Mobahi , Yi Tay

Genie: Show Me the Data for Quantization

Zero-shot quantization is a promising approach for developing lightweight deep neural networks when data is inaccessible owing to various reasons, including cost and issues related to privacy. By exploiting the learned parameters ($\mu$ and…

Machine Learning · Computer Science 2023-08-09 Yongkweon Jeon , Chungman Lee , Ho-young Kim

Can Quantized Audio Language Models Perform Zero-Shot Spoofing Detection?

Quantization is essential for deploying large audio language models (LALMs) efficiently in resource-constrained environments. However, its impact on complex tasks, such as zero-shot audio spoofing detection, remains underexplored. This…

Sound · Computer Science 2025-06-10 Bikash Dutta , Rishabh Ranjan , Shyam Sathvik , Mayank Vatsa , Richa Singh

LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models

Quantization is an indispensable technique for serving Large Language Models (LLMs) and has recently found its way into LoRA fine-tuning. In this work we focus on the scenario where quantization and LoRA fine-tuning are applied together on…

Computation and Language · Computer Science 2023-11-29 Yixiao Li , Yifan Yu , Chen Liang , Pengcheng He , Nikos Karampatziakis , Weizhu Chen , Tuo Zhao

SiLQ: Simple Large Language Model Quantization-Aware Training

Large language models can be quantized to reduce inference time latency, model size, and energy consumption, thereby delivering a better user experience at lower cost. A challenge exists to deliver quantized models with minimal loss of…

Machine Learning · Computer Science 2025-07-24 Steven K. Esser , Jeffrey L. McKinstry , Deepika Bablani , Rathinakumar Appuswamy , Dharmendra S. Modha

SASQ: Static Activation Scaling for Quantization-Aware Training in Large Language Models

Large language models (LLMs) excel at natural language tasks but face deployment challenges due to their growing size outpacing GPU memory advancements. Model quantization mitigates this issue by lowering weight and activation precision,…

Computation and Language · Computer Science 2025-12-17 Shizhuo Mao , Song Chen , Yi Kang