Related papers: MRQ:Support Multiple Quantization Schemes through …

Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework

Deep Neural Networks (DNNs) have achieved extraordinary performance in various application domains. To support diverse DNN models, efficient implementations of DNN inference on edge-computing platforms, e.g., ASICs, FPGAs, and embedded…

Machine Learning · Computer Science 2020-12-15 Sung-En Chang , Yanyu Li , Mengshu Sun , Runbin Shi , Hayden K. -H. So , Xuehai Qian , Yanzhi Wang , Xue Lin

MWQ: Multiscale Wavelet Quantized Neural Networks

Model quantization can reduce the model size and computational latency, it has become an essential technique for the deployment of deep neural networks on resourceconstrained hardware (e.g., mobile phones and embedded devices). The existing…

Computer Vision and Pattern Recognition · Computer Science 2021-03-10 Qigong Sun , Yan Ren , Licheng Jiao , Xiufang Li , Fanhua Shang , Fang Liu

Learnable Companding Quantization for Accurate Low-bit Neural Networks

Quantizing deep neural networks is an effective method for reducing memory consumption and improving inference speed, and is thus useful for implementation in resource-constrained devices. However, it is still hard for extremely low-bit…

Computer Vision and Pattern Recognition · Computer Science 2021-11-03 Kohei Yamamoto

Hardware-Centric AutoML for Mixed-Precision Quantization

Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support mixed precision (1-8 bits) to further improve the computation efficiency,…

Computer Vision and Pattern Recognition · Computer Science 2020-08-14 Kuan Wang , Zhijian Liu , Yujun Lin , Ji Lin , Song Han

A Closer Look at Hardware-Friendly Weight Quantization

Quantizing a Deep Neural Network (DNN) model to be used on a custom accelerator with efficient fixed-point hardware implementations, requires satisfying many stringent hardware-friendly quantization constraints to train the model. We…

Machine Learning · Computer Science 2022-10-10 Sungmin Bae , Piotr Zielinski , Satrajit Chatterjee

Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization

Large language models (LLMs) are increasingly deployed on mobile devices, where Neural Processing Units (NPUs) necessitate fully static quantization for optimal inference efficiency. However, existing post-training quantization (PTQ)…

Machine Learning · Computer Science 2026-05-21 Jinghe Zhang , Daliang Xu , Chenghua Wang , Weikai Xie , Tao Qi , Yun Ma , Mengwei Xu , Gang Huang

RepQ: Generalizing Quantization-Aware Training for Re-Parametrized Architectures

Existing neural networks are memory-consuming and computationally intensive, making deploying them challenging in resource-constrained environments. However, there are various methods to improve their efficiency. Two such methods are…

Machine Learning · Computer Science 2023-11-10 Anastasiia Prutianova , Alexey Zaytsev , Chung-Kuei Lee , Fengyu Sun , Ivan Koryakovskiy

Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks

Hardware-friendly network quantization (e.g., binary/uniform quantization) can efficiently accelerate the inference and meanwhile reduce memory consumption of the deep neural networks, which is crucial for model deployment on…

Computer Vision and Pattern Recognition · Computer Science 2019-08-15 Ruihao Gong , Xianglong Liu , Shenghu Jiang , Tianxiang Li , Peng Hu , Jiazhen Lin , Fengwei Yu , Junjie Yan

HAQ: Hardware-Aware Automated Quantization with Mixed Precision

Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support mixed precision (1-8 bits) to further improve the computation efficiency,…

Computer Vision and Pattern Recognition · Computer Science 2019-04-09 Kuan Wang , Zhijian Liu , Yujun Lin , Ji Lin , Song Han

DBQ: A Differentiable Branch Quantizer for Lightweight Deep Neural Networks

Deep neural networks have achieved state-of-the art performance on various computer vision tasks. However, their deployment on resource-constrained devices has been hindered due to their high computational and storage complexity. While…

Computer Vision and Pattern Recognition · Computer Science 2020-07-21 Hassan Dbouk , Hetul Sanghvi , Mahesh Mehendale , Naresh Shanbhag

Weight Normalization based Quantization for Deep Neural Network Compression

With the development of deep neural networks, the size of network models becomes larger and larger. Model compression has become an urgent need for deploying these network models to mobile or embedded devices. Model quantization is a…

Machine Learning · Computer Science 2019-07-02 Wen-Pu Cai , Wu-Jun Li

Retraining-free Model Quantization via One-Shot Weight-Coupling Learning

Quantization is of significance for compressing the over-parameterized deep neural models and deploying them on resource-limited devices. Fixed-precision quantization suffers from performance drop due to the limited numerical representation…

Computer Vision and Pattern Recognition · Computer Science 2024-06-17 Chen Tang , Yuan Meng , Jiacheng Jiang , Shuzhao Xie , Rongwei Lu , Xinzhu Ma , Zhi Wang , Wenwu Zhu

Differentiable Dynamic Quantization with Mixed Precision and Adaptive Resolution

Model quantization is challenging due to many tedious hyper-parameters such as precision (bitwidth), dynamic range (minimum and maximum discrete values) and stepsize (interval between discrete values). Unlike prior arts that carefully tune…

Machine Learning · Computer Science 2021-07-08 Zhang Zhaoyang , Shao Wenqi , Gu Jinwei , Wang Xiaogang , Luo Ping

RAPQ: Rescuing Accuracy for Power-of-Two Low-bit Post-training Quantization

We introduce a Power-of-Two low-bit post-training quantization(PTQ) method for deep neural network that meets hardware requirements and does not call for long-time retraining. Power-of-Two quantization can convert the multiplication…

Computer Vision and Pattern Recognition · Computer Science 2022-09-27 Hongyi Yao , Pu Li , Jian Cao , Xiangcheng Liu , Chenying Xie , Bingzhang Wang

ZeroQ: A Novel Zero Shot Quantization Framework

Quantization is a promising approach for reducing the inference time and memory footprint of neural networks. However, most existing quantization methods require access to the original training dataset for retraining during quantization.…

Computer Vision and Pattern Recognition · Computer Science 2020-03-29 Yaohui Cai , Zhewei Yao , Zhen Dong , Amir Gholami , Michael W. Mahoney , Kurt Keutzer

MQBench: Towards Reproducible and Deployable Model Quantization Benchmark

Model quantization has emerged as an indispensable technique to accelerate deep learning inference. While researchers continue to push the frontier of quantization algorithms, existing quantization work is often unreproducible and…

Machine Learning · Computer Science 2022-01-26 Yuhang Li , Mingzhu Shen , Jian Ma , Yan Ren , Mingxin Zhao , Qi Zhang , Ruihao Gong , Fengwei Yu , Junjie Yan

MSP: An FPGA-Specific Mixed-Scheme, Multi-Precision Deep Neural Network Quantization Framework

With the tremendous success of deep learning, there exists imminent need to deploy deep learning models onto edge devices. To tackle the limited computing and storage resources in edge devices, model compression techniques have been widely…

Machine Learning · Computer Science 2020-10-20 Sung-En Chang , Yanyu Li , Mengshu Sun , Weiwen Jiang , Runbin Shi , Xue Lin , Yanzhi Wang

ECQ$^{\text{x}}$: Explainability-Driven Quantization for Low-Bit and Sparse DNNs

The remarkable success of deep neural networks (DNNs) in various applications is accompanied by a significant increase in network parameters and arithmetic operations. Such increases in memory and computational demands make deep learning…

Machine Learning · Computer Science 2024-06-07 Daniel Becking , Maximilian Dreyer , Wojciech Samek , Karsten Müller , Sebastian Lapuschkin

LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid

Large language models (LLMs) have shown immense potential across various domains, but their high memory requirements and inference costs remain critical challenges for deployment. Post-training quantization (PTQ) has emerged as a promising…

Machine Learning · Computer Science 2026-01-05 Tianyi Zhang , Anshumali Shrivastava

MSQ: Memory-Efficient Bit Sparsification Quantization

As deep neural networks (DNNs) see increased deployment on mobile and edge devices, optimizing model efficiency has become crucial. Mixed-precision quantization is widely favored, as it offers a superior balance between efficiency and…

Machine Learning · Computer Science 2025-07-31 Seokho Han , Seoyeon Yoon , Jinhee Kim , Dongwei Wang , Kang Eun Jeon , Huanrui Yang , Jong Hwan Ko