Related papers: Automated Backend-Aware Post-Training Quantization

HALO: Hardware-aware quantization with low critical-path-delay weights for LLM acceleration

Quantization is critical for efficiently deploying large language models (LLMs). Yet conventional methods remain hardware-agnostic, limited to bit-width constraints, and do not account for intrinsic circuit characteristics such as the…

Hardware Architecture · Computer Science 2025-11-18 Rohan Juneja , Shivam Aggarwal , Safeen Huda , Tulika Mitra , Li-Shiuan Peh

HAQ: Hardware-Aware Automated Quantization with Mixed Precision

Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support mixed precision (1-8 bits) to further improve the computation efficiency,…

Computer Vision and Pattern Recognition · Computer Science 2019-04-09 Kuan Wang , Zhijian Liu , Yujun Lin , Ji Lin , Song Han

HERO: Hessian-Enhanced Robust Optimization for Unifying and Improving Generalization and Quantization Performance

With the recent demand of deploying neural network models on mobile and edge devices, it is desired to improve the model's generalizability on unseen testing data, as well as enhance the model's robustness under fixed-point quantization for…

Machine Learning · Computer Science 2021-11-26 Huanrui Yang , Xiaoxuan Yang , Neil Zhenqiang Gong , Yiran Chen

Tango: rethinking quantization for graph neural network training on GPUs

Graph Neural Networks (GNNs) are becoming increasingly popular due to their superior performance in critical graph-related tasks. While quantization is widely used to accelerate GNN computation, quantized training faces unprecedented…

Machine Learning · Computer Science 2023-09-04 Shiyang Chen , Da Zheng , Caiwen Ding , Chengying Huan , Yuede Ji , Hang Liu

HPTQ: Hardware-Friendly Post Training Quantization

Neural network quantization enables the deployment of models on edge devices. An essential requirement for their hardware efficiency is that the quantizers are hardware-friendly: uniform, symmetric, and with power-of-two thresholds. To the…

Computer Vision and Pattern Recognition · Computer Science 2021-11-17 Hai Victor Habi , Reuven Peretz , Elad Cohen , Lior Dikstein , Oranit Dror , Idit Diamant , Roy H. Jennings , Arnon Netzer

HERO: Hardware-Efficient RL-based Optimization Framework for NeRF Quantization

Neural Radiance Field (NeRF) has emerged as a promising 3D reconstruction method, delivering high-quality results for AR/VR applications. While quantization methods and hardware accelerators have been proposed to enhance NeRF's…

Hardware Architecture · Computer Science 2025-10-13 Yipu Zhang , Chaofang Ma , Jinming Ge , Lin Jiang , Jiang Xu , Wei Zhang

Hardware-Centric AutoML for Mixed-Precision Quantization

Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support mixed precision (1-8 bits) to further improve the computation efficiency,…

Computer Vision and Pattern Recognition · Computer Science 2020-08-14 Kuan Wang , Zhijian Liu , Yujun Lin , Ji Lin , Song Han

Sensitivity-Aware Post-Training Quantization for Deep Neural Networks

Model quantization reduces neural network parameter precision to achieve compression, but often compromises accuracy. Existing post-training quantization (PTQ) methods employ iterative parameter updates to preserve accuracy under high…

Computer Vision and Pattern Recognition · Computer Science 2025-09-09 Zekang Zheng , Haokun Li , Yaofo Chen , Mingkui Tan , Qing Du

HarmoQ: Harmonized Post-Training Quantization for High-Fidelity Image

Post-training quantization offers an efficient pathway to deploy super-resolution models, yet existing methods treat weight and activation quantization independently, missing their critical interplay. Through controlled experiments on…

Image and Video Processing · Electrical Eng. & Systems 2025-11-12 Hongjun Wang , Jiyuan Chen , Xuan Song , Yinqiang Zheng

A Practical Mixed Precision Algorithm for Post-Training Quantization

Neural network quantization is frequently used to optimize model size, latency and power consumption for on-device deployment of neural networks. In many cases, a target bit-width is set for an entire network, meaning every layer get…

Machine Learning · Computer Science 2023-02-13 Nilesh Prasad Pandey , Markus Nagel , Mart van Baalen , Yin Huang , Chirag Patel , Tijmen Blankevoort

ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training Quantization Framework for W8A8 Transformers

Quantization techniques are pivotal in reducing the memory and computational demands of deep neural network inference. Existing solutions, such as ZeroQuant, offer dynamic quantization for models like BERT and GPT but overlook crucial…

Machine Learning · Computer Science 2023-10-30 Zhewei Yao , Reza Yazdani Aminabadi , Stephen Youn , Xiaoxia Wu , Elton Zheng , Yuxiong He

Post-training 4-bit quantization of convolution networks for rapid-deployment

Convolutional neural networks require significant memory bandwidth and storage for intermediate computations, apart from substantial computing resources. Neural network quantization has significant benefits in reducing the amount of…

Computer Vision and Pattern Recognition · Computer Science 2019-05-30 Ron Banner , Yury Nahshan , Elad Hoffer , Daniel Soudry

On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks

Low-bit quantization emerges as one of the most promising compression approaches for deploying deep neural networks on edge devices. Mixed-precision quantization leverages a mixture of bit-widths to unleash the accuracy and efficiency…

Machine Learning · Computer Science 2024-05-24 Wei Huang , Haotong Qin , Yangdong Liu , Jingzhuo Liang , Yulun Zhang , Ying Li , Xianglong Liu

HALO: Hadamard-Assisted Lower-Precision Optimization for LLMs

Quantized training of Large Language Models (LLMs) remains an open challenge, as maintaining accuracy while performing all matrix multiplications in low precision has proven difficult. This is particularly the case when fine-tuning…

Machine Learning · Computer Science 2025-11-06 Saleh Ashkboos , Mahdi Nikdan , Soroush Tabesh , Roberto L. Castro , Torsten Hoefler , Dan Alistarh

HAO: Hardware-aware neural Architecture Optimization for Efficient Inference

Automatic algorithm-hardware co-design for DNN has shown great success in improving the performance of DNNs on FPGAs. However, this process remains challenging due to the intractable search space of neural network architectures and hardware…

Computer Vision and Pattern Recognition · Computer Science 2021-04-27 Zhen Dong , Yizhao Gao , Qijing Huang , John Wawrzynek , Hayden K. H. So , Kurt Keutzer

A White Paper on Neural Network Quantization

While neural networks have advanced the frontiers in many applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is key if we want to integrate modern networks into edge…

Machine Learning · Computer Science 2021-06-16 Markus Nagel , Marios Fournarakis , Rana Ali Amjad , Yelysei Bondarenko , Mart van Baalen , Tijmen Blankevoort

Hardware-friendly Deep Learning by Network Quantization and Binarization

Quantization is emerging as an efficient approach to promote hardware-friendly deep learning and run deep neural networks on resource-limited hardware. However, it still causes a significant decrease to the network in accuracy. We summarize…

Machine Learning · Computer Science 2021-12-03 Haotong Qin

Degree-Quant: Quantization-Aware Training for Graph Neural Networks

Graph neural networks (GNNs) have demonstrated strong performance on a wide variety of tasks due to their ability to model non-uniform structured data. Despite their promise, there exists little research exploring methods to make them more…

Machine Learning · Computer Science 2021-03-16 Shyam A. Tailor , Javier Fernandez-Marques , Nicholas D. Lane

High-Accuracy Low-Precision Training

Low-precision computation is often used to lower the time and energy cost of machine learning, and recently hardware accelerators have been developed to support it. Still, it has been used primarily for inference - not training. Previous…

Machine Learning · Computer Science 2018-03-12 Christopher De Sa , Megan Leszczynski , Jian Zhang , Alana Marzoev , Christopher R. Aberger , Kunle Olukotun , Christopher Ré

EfQAT: An Efficient Framework for Quantization-Aware Training

Quantization-aware training (QAT) schemes have been shown to achieve near-full precision accuracy. They accomplish this by training a quantized model for multiple epochs. This is computationally expensive, mainly because of the full…

Machine Learning · Computer Science 2024-11-19 Saleh Ashkboos , Bram Verhoef , Torsten Hoefler , Evangelos Eleftheriou , Martino Dazzi