English
Related papers

Related papers: ZeroQuant-HERO: Hardware-Enhanced Robust Optimized…

200 papers

How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements. In this work, we present an…

Computation and Language · Computer Science 2022-06-07 Zhewei Yao , Reza Yazdani Aminabadi , Minjia Zhang , Xiaoxia Wu , Conglong Li , Yuxiong He

Transformer-based models, such as BERT, have been widely applied in a wide range of natural language processing tasks. However, one inevitable side effect is that they require massive memory storage and inference cost when deployed in…

Artificial Intelligence · Computer Science 2023-12-13 Jianwei Li , Tianchi Zhang , Ian En-Hsu Yen , Dongkuan Xu

Quantization is an effective technique to reduce the deployment cost of large language models (LLMs), and post-training quantization (PTQ) has been widely studied due to its efficiency. However, existing PTQ methods are limited by their…

Machine Learning · Computer Science 2025-09-30 Qitao Tan , Xiaoying Song , Jin Lu , Guoming Li , Jun Liu , Lingzi Hong , Caiwen Ding , Jundong Li , Xiaoming Zhai , Shaoyi Huang , Wei Niu , Geng Yuan

Quantization is a promising approach for reducing the inference time and memory footprint of neural networks. However, most existing quantization methods require access to the original training dataset for retraining during quantization.…

Computer Vision and Pattern Recognition · Computer Science 2020-03-29 Yaohui Cai , Zhewei Yao , Zhen Dong , Amir Gholami , Michael W. Mahoney , Kurt Keutzer

Neural network training is a memory- and compute-intensive task. Quantization, which enables low-bitwidth formats in training, can significantly mitigate the workload. To reduce quantization error, recent methods have developed new data…

Machine Learning · Computer Science 2024-11-19 Wenjin Guo , Donglai Liu , Weiying Xie , Yunsong Li , Xuefei Ning , Zihan Meng , Shulin Zeng , Jie Lei , Zhenman Fang , Yu Wang

Deep neural networks have achieved state-of-the-art results in a wide range of applications, from natural language processing and computer vision to speech recognition. However, as tasks become increasingly complex, model sizes continue to…

Computer Vision and Pattern Recognition · Computer Science 2025-05-21 Tomer Gafni , Asaf Karnieli , Yair Hanani

Efficient inference is critical for deploying deep learning models on edge AI devices. Low-bit quantization (e.g., 3- and 4-bit) with fixed-point arithmetic improves efficiency, while low-power memory technologies like analog nonvolatile…

Machine Learning · Computer Science 2025-07-15 Anmol Biswas , Raghav Singhal , Sivakumar Elangovan , Shreyas Sabnis , Udayan Ganguly

With the recent demand of deploying neural network models on mobile and edge devices, it is desired to improve the model's generalizability on unseen testing data, as well as enhance the model's robustness under fixed-point quantization for…

Machine Learning · Computer Science 2021-11-26 Huanrui Yang , Xiaoxuan Yang , Neil Zhenqiang Gong , Yiran Chen

The 8 bits quantization has been widely applied to accelerate network inference in various deep learning applications. There are two kinds of quantization methods, training-based quantization and post-training quantization. Training-based…

Computer Vision and Pattern Recognition · Computer Science 2020-07-01 Di Wu , Qi Tang , Yongle Zhao , Ming Zhang , Ying Fu , Debing Zhang

Quantization is a key technique to reduce the resource requirement and improve the performance of neural network deployment. However, different hardware backends such as x86 CPU, NVIDIA GPU, ARM CPU, and accelerators may demand different…

Computer Vision and Pattern Recognition · Computer Science 2021-03-30 Ziheng Jiang , Animesh Jain , Andrew Liu , Josh Fromm , Chengqian Ma , Tianqi Chen , Luis Ceze

Neural network quantization is widely used to reduce model inference complexity in real-world deployments. However, traditional integer quantization suffers from accuracy degradation when adapting to various dynamic ranges. Recent research…

Performance · Computer Science 2023-10-30 Zhuoyi Zhang , Yunchen Zhang , Gonglei Shi , Yu Shen , Ruihao Gong , Xiaoxu Xia , Qi Zhang , Lewei Lu , Xianglong Liu

We introduce a novel run-time method for significantly reducing the accuracy loss associated with quantizing BERT-like models to 8-bit integers. Existing methods for quantizing models either modify the training procedure,or they require an…

Computation and Language · Computer Science 2022-11-18 Yousef El-Kurdi , Jerry Quinn , Avirup Sil

Model quantization is a promising approach to compress deep neural networks and accelerate inference, making it possible to be deployed on mobile and edge devices. To retain the high performance of full-precision models, most existing…

Computer Vision and Pattern Recognition · Computer Science 2021-03-31 Yuang Liu , Wei Zhang , Jun Wang

Transformers are the backbone of powerful foundation models for many Vision and Natural Language Processing tasks. But their compute and memory/storage footprint is large, and so, serving such models is expensive often requiring high-end…

Machine Learning · Computer Science 2024-08-01 Harshavardhan Adepu , Zhanpeng Zeng , Li Zhang , Vikas Singh

The increasing computational and memory demands in deep learning present significant challenges, especially in resource-constrained environments. We introduce a zero-order quantized optimization (ZOQO) method designed for training models…

Machine Learning · Computer Science 2025-01-14 Noga Bar , Raja Giryes

We introduce a data-free quantization method for deep neural networks that does not require fine-tuning or hyperparameter selection. It achieves near-original model performance on common computer vision architectures and tasks. 8-bit…

Machine Learning · Computer Science 2019-11-26 Markus Nagel , Mart van Baalen , Tijmen Blankevoort , Max Welling

Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, existing methods cannot maintain accuracy and hardware efficiency at the same…

Computation and Language · Computer Science 2024-04-03 Guangxuan Xiao , Ji Lin , Mickael Seznec , Hao Wu , Julien Demouth , Song Han

Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs, restricting their practical deployment. While existing INT4/INT8 quantization reduces these costs, they often degrade…

Machine Learning · Computer Science 2025-11-04 Hao Zhang , Aining Jia , Weifeng Bu , Yushu Cai , Kai Sheng , Hao Chen , Xin He

In the complex domain of large language models (LLMs), striking a balance between computational efficiency and maintaining model quality is a formidable challenge. Navigating the inherent limitations of uniform quantization, particularly…

Machine Learning · Computer Science 2023-07-24 Xiaoxia Wu , Zhewei Yao , Yuxiong He

Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support mixed precision (1-8 bits) to further improve the computation efficiency,…

Computer Vision and Pattern Recognition · Computer Science 2020-08-14 Kuan Wang , Zhijian Liu , Yujun Lin , Ji Lin , Song Han
‹ Prev 1 2 3 10 Next ›