Related papers: Accelerating Neural Network Inference by Overflow …

Quantizing Convolutional Neural Networks for Low-Power High-Throughput Inference Engines

Deep learning as a means to inferencing has proliferated thanks to its versatility and ability to approach or exceed human-level accuracy. These computational models have seemingly insatiable appetites for computational resources not only…

Machine Learning · Computer Science 2018-05-22 Sean O. Settle , Manasa Bollavaram , Paolo D'Alberto , Elliott Delaye , Oscar Fernandez , Nicholas Fraser , Aaron Ng , Ashish Sirasao , Michael Wu

A2Q+: Improving Accumulator-Aware Weight Quantization

Quantization techniques commonly reduce the inference costs of neural networks by restricting the precision of weights and activations. Recent studies show that also reducing the precision of the accumulator can further improve hardware…

Machine Learning · Computer Science 2024-01-22 Ian Colbert , Alessandro Pappalardo , Jakoba Petri-Koenig , Yaman Umuroglu

Quantized Neural Networks for Low-Precision Accumulation with Guaranteed Overflow Avoidance

We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference. We leverage weight normalization as a means of constraining parameters during…

Machine Learning · Computer Science 2023-02-01 Ian Colbert , Alessandro Pappalardo , Jakoba Petri-Koenig

Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation

Quantization techniques can reduce the size of Deep Neural Networks and improve inference latency and throughput by taking advantage of high throughput integer instructions. In this paper we review the mathematical aspects of quantization…

Machine Learning · Computer Science 2020-04-22 Hao Wu , Patrick Judd , Xiaojie Zhang , Mikhail Isaev , Paulius Micikevicius

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

The rising popularity of intelligent mobile devices and the daunting computational cost of deep learning-based models call for efficient and accurate on-device inference schemes. We propose a quantization scheme that allows inference to be…

Machine Learning · Computer Science 2017-12-19 Benoit Jacob , Skirmantas Kligys , Bo Chen , Menglong Zhu , Matthew Tang , Andrew Howard , Hartwig Adam , Dmitry Kalenichenko

A Survey of Quantization Methods for Efficient Neural Network Inference

As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related…

Computer Vision and Pattern Recognition · Computer Science 2021-06-23 Amir Gholami , Sehoon Kim , Zhen Dong , Zhewei Yao , Michael W. Mahoney , Kurt Keutzer

A2Q: Accumulator-Aware Quantization with Guaranteed Overflow Avoidance

We present accumulator-aware quantization (A2Q), a novel weight quantization method designed to train quantized neural networks (QNNs) to avoid overflow when using low-precision accumulators during inference. A2Q introduces a unique…

Machine Learning · Computer Science 2023-08-28 Ian Colbert , Alessandro Pappalardo , Jakoba Petri-Koenig

Mixed-Precision Inference Quantization: Radically Towards Faster inference speed, Lower Storage requirement, and Lower Loss

Based on the model's resilience to computational noise, model quantization is important for compressing models and improving computing speed. Existing quantization techniques rely heavily on experience and "fine-tuning" skills. In the…

Machine Learning · Computer Science 2022-07-22 Daning Cheng , Wenguang Chen

Fixed-point Quantization of Convolutional Neural Networks for Quantized Inference on Embedded Platforms

Convolutional Neural Networks (CNNs) have proven to be a powerful state-of-the-art method for image classification tasks. One drawback however is the high computational complexity and high memory consumption of CNNs which makes them…

Computer Vision and Pattern Recognition · Computer Science 2021-02-04 Rishabh Goyal , Joaquin Vanschoren , Victor van Acht , Stephan Nijssen

Efficient and Effective Methods for Mixed Precision Neural Network Quantization for Faster, Energy-efficient Inference

For efficient neural network inference, it is desirable to achieve state-of-the-art accuracy with the simplest networks requiring the least computation, memory, and power. Quantizing networks to lower precision is a powerful technique for…

Machine Learning · Computer Science 2024-01-12 Deepika Bablani , Jeffrey L. Mckinstry , Steven K. Esser , Rathinakumar Appuswamy , Dharmendra S. Modha

Dataflow-based Joint Quantization of Weights and Activations for Deep Neural Networks

This paper addresses a challenging problem - how to reduce energy consumption without incurring performance drop when deploying deep neural networks (DNNs) at the inference stage. In order to alleviate the computation and storage burdens,…

Machine Learning · Computer Science 2019-01-09 Xue Geng , Jie Fu , Bin Zhao , Jie Lin , Mohamed M. Sabry Aly , Christopher Pal , Vijay Chandrasekhar

On the Acceleration of Deep Neural Network Inference using Quantized Compressed Sensing

Accelerating deep neural network (DNN) inference on resource-limited devices is one of the most important barriers to ensuring a wider and more inclusive adoption. To alleviate this, DNN binary quantization for faster convolution and memory…

Machine Learning · Computer Science 2021-08-24 Meshia Cédric Oveneke

Alternating Multi-bit Quantization for Recurrent Neural Networks

Recurrent neural networks have achieved excellent performance in many applications. However, on portable devices with limited resources, the models are often too large to deploy. For applications on the server with large scale concurrent…

Machine Learning · Computer Science 2018-02-02 Chen Xu , Jianqiang Yao , Zhouchen Lin , Wenwu Ou , Yuanbin Cao , Zhirong Wang , Hongbin Zha

Quantized Neural Network Inference with Precision Batching

We present PrecisionBatching, a quantized inference algorithm for speeding up neural network execution on traditional hardware platforms at low bitwidths without the need for retraining or recalibration. PrecisionBatching decomposes a…

Machine Learning · Computer Science 2020-03-03 Maximilian Lam , Zachary Yedidia , Colby Banbury , Vijay Janapa Reddi

Incrementally-Computable Neural Networks: Efficient Inference for Dynamic Inputs

Deep learning often faces the challenge of efficiently processing dynamic inputs, such as sensor data or user inputs. For example, an AI writing assistant is required to update its suggestions in real time as a document is edited.…

Machine Learning · Computer Science 2023-07-28 Or Sharir , Anima Anandkumar

Effective Quantization for Diffusion Models on CPUs

Diffusion models have gained popularity for generating images from textual descriptions. Nonetheless, the substantial need for computational resources continues to present a noteworthy challenge, contributing to time-consuming processes.…

Computer Vision and Pattern Recognition · Computer Science 2023-11-30 Hanwen Chang , Haihao Shen , Yiyang Cai , Xinyu Ye , Zhenzhong Xu , Wenhua Cheng , Kaokao Lv , Weiwei Zhang , Yintong Lu , Heng Guo

Quantizing deep convolutional networks for efficient inference: A whitepaper

We present an overview of techniques for quantizing convolutional neural networks for inference with integer weights and activations. Per-channel quantization of weights and per-layer quantization of activations to 8-bits of precision…

Machine Learning · Computer Science 2018-06-22 Raghuraman Krishnamoorthi

On the efficient representation and execution of deep acoustic models

In this paper we present a simple and computationally efficient quantization scheme that enables us to reduce the resolution of the parameters of a neural network from 32-bit floating point values to 8-bit integer values. The proposed…

Machine Learning · Computer Science 2016-12-20 Raziel Alvarez , Rohit Prabhavalkar , Anton Bakhtin

Towards Efficient Verification of Quantized Neural Networks

Quantization replaces floating point arithmetic with integer arithmetic in deep neural network models, providing more efficient on-device inference with less power and memory. In this work, we propose a framework for formally verifying…

Machine Learning · Computer Science 2023-12-29 Pei Huang , Haoze Wu , Yuting Yang , Ieva Daukantas , Min Wu , Yedi Zhang , Clark Barrett

QPART: Adaptive Model Quantization and Dynamic Workload Balancing for Accuracy-aware Edge Inference

As machine learning inferences increasingly move to edge devices, adapting to diverse computational capabilities, hardware, and memory constraints becomes more critical. Instead of relying on a pre-trained model fixed for all future…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-01 Xiangchen Li , Saeid Ghafouri , Bo Ji , Hans Vandierendonck , Deepu John , Dimitrios S. Nikolopoulos