Related papers: Accelerating PoT Quantization on Edge Devices

PoTAcc: A Pipeline for End-to-End Acceleration of Power-of-Two Quantized DNNs

Power-of-two (PoT) quantization significantly reduces the size of deep neural networks (DNNs) and replaces multiplications with bit-shift operations for inference. Prior work has shown that PoT-quantized DNNs can preserve accuracy for tasks…

Hardware Architecture · Computer Science 2026-05-08 Rappy Saha , Jude Haris , Nicolas Bohm Agostini , David Kaeli , José Cano

n-hot: Efficient bit-level sparsity for powers-of-two neural network quantization

Powers-of-two (PoT) quantization reduces the number of bit operations of deep neural networks on resource-constrained hardware. However, PoT quantization triggers a severe accuracy drop because of its limited representation ability. Since…

Computer Vision and Pattern Recognition · Computer Science 2021-03-23 Yuiko Sakuma , Hiroshi Sumihiro , Jun Nishikawa , Toshiki Nakamura , Ryoji Ikegaya

Energy Efficient Hardware Acceleration of Neural Networks with Power-of-Two Quantisation

Deep neural networks virtually dominate the domain of most modern vision systems, providing high performance at a cost of increased computational complexity.Since for those systems it is often required to operate both in real-time and with…

Computer Vision and Pattern Recognition · Computer Science 2023-11-14 Dominika Przewlocka-Rus , Tomasz Kryjak

Power-of-Two Quantization for Low Bitwidth and Hardware Compliant Neural Networks

Deploying Deep Neural Networks in low-power embedded devices for real time-constrained applications requires optimization of memory and computational complexity of the networks, usually by quantizing the weights. Most of the existing works…

Machine Learning · Computer Science 2022-03-11 Dominika Przewlocka-Rus , Syed Shakib Sarwar , H. Ekin Sumbul , Yuecheng Li , Barbara De Salvo

Power-of-Two Quantization-Aware-Training (PoT-QAT) in Large Language Models (LLMs)

In Large Language Models (LLMs), the number of parameters has grown exponentially in the past few years, e.g., from 1.5 billion parameters in GPT-2 to 175 billion in GPT-3 to possibly more than trillion in higher versions. This raises a…

Computation and Language · Computer Science 2026-01-06 Mahmoud Elgenedy

Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework

Deep Neural Networks (DNNs) have achieved extraordinary performance in various application domains. To support diverse DNN models, efficient implementations of DNN inference on edge-computing platforms, e.g., ASICs, FPGAs, and embedded…

Machine Learning · Computer Science 2020-12-15 Sung-En Chang , Yanyu Li , Mengshu Sun , Runbin Shi , Hayden K. -H. So , Xuehai Qian , Yanzhi Wang , Xue Lin

PQA: Exploring the Potential of Product Quantization in DNN Hardware Acceleration

Conventional multiply-accumulate (MAC) operations have long dominated computation time for deep neural networks (DNNs), espcially convolutional neural networks (CNNs). Recently, product quantization (PQ) has been applied to these workloads,…

Hardware Architecture · Computer Science 2024-04-01 Ahmed F. AbouElhamayed , Angela Cui , Javier Fernandez-Marques , Nicholas D. Lane , Mohamed S. Abdelfattah

Edge Inference with Fully Differentiable Quantized Mixed Precision Neural Networks

The large computing and memory cost of deep neural networks (DNNs) often precludes their use in resource-constrained devices. Quantizing the parameters and operations to lower bit-precision offers substantial memory and energy savings for…

Machine Learning · Computer Science 2023-09-01 Clemens JS Schaefer , Siddharth Joshi , Shan Li , Raul Blazquez

PoTPTQ: A Two-step Power-of-Two Post-training for LLMs

Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, their deployment is challenging due to the substantial computational resources required. Power-of-two…

Computation and Language · Computer Science 2025-07-17 Xinyu Wang , Vahid Partovi Nia , Peng Lu , Jerry Huang , Xiao-Wen Chang , Boxing Chen , Yufei Cui

Quality Scalable Quantization Methodology for Deep Learning on Edge

Deep Learning Architectures employ heavy computations and bulk of the computational energy is taken up by the convolution operations in the Convolutional Neural Networks. The objective of our proposed work is to reduce the energy…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-17 Salman Abdul Khaliq , Rehan Hafiz

AccEPT: An Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

It is usually infeasible to fit and train an entire large deep neural network (DNN) model using a single edge device due to the limited resources. To facilitate intelligent applications across edge devices, researchers have proposed…

Machine Learning · Computer Science 2023-11-13 Yuhao Chen , Yuxuan Yan , Qianqian Yang , Yuanchao Shu , Shibo He , Zhiguo Shi , Jiming Chen

Algorithm-Hardware Co-Design of Distribution-Aware Logarithmic-Posit Encodings for Efficient DNN Inference

Traditional Deep Neural Network (DNN) quantization methods using integer, fixed-point, or floating-point data types struggle to capture diverse DNN parameter distributions at low precision, and often require large silicon overhead and…

Hardware Architecture · Computer Science 2024-03-28 Akshat Ramachandran , Zishen Wan , Geonhwa Jeong , John Gustafson , Tushar Krishna

Progressive Element-wise Gradient Estimation for Neural Network Quantization

Neural network quantization aims to reduce the bit-widths of weights and activations, making it a critical technique for deploying deep neural networks on resource-constrained hardware. Most Quantization-Aware Training (QAT) methods rely on…

Machine Learning · Computer Science 2025-09-03 Kaiqi Zhao

PIPE : Parallelized Inference Through Post-Training Quantization Ensembling of Residual Expansions

Deep neural networks (DNNs) are ubiquitous in computer vision and natural language processing, but suffer from high inference cost. This problem can be addressed by quantization, which consists in converting floating point perations into a…

Computer Vision and Pattern Recognition · Computer Science 2023-11-28 Edouard Yvinec , Arnaud Dapogny , Kevin Bailly

Dataflow-based Joint Quantization of Weights and Activations for Deep Neural Networks

This paper addresses a challenging problem - how to reduce energy consumption without incurring performance drop when deploying deep neural networks (DNNs) at the inference stage. In order to alleviate the computation and storage burdens,…

Machine Learning · Computer Science 2019-01-09 Xue Geng , Jie Fu , Bin Zhao , Jie Lin , Mohamed M. Sabry Aly , Christopher Pal , Vijay Chandrasekhar

QADAM: Quantization-Aware DNN Accelerator Modeling for Pareto-Optimality

As the machine learning and systems communities strive to achieve higher energy-efficiency through custom deep neural network (DNN) accelerators, varied bit precision or quantization levels, there is a need for design space exploration…

Hardware Architecture · Computer Science 2022-05-27 Ahmet Inci , Siri Garudanagiri Virupaksha , Aman Jain , Venkata Vivek Thallam , Ruizhou Ding , Diana Marculescu

Low- and Mixed-Precision Inference Accelerators

With the surging popularity of edge computing, the need to efficiently perform neural network inference on battery-constrained IoT devices has greatly increased. While algorithmic developments enable neural networks to solve increasingly…

Hardware Architecture · Computer Science 2022-06-27 Maarten Molendijk , Floran de Putter , Henk Corporaal

Low-bit Shift Network for End-to-End Spoken Language Understanding

Deep neural networks (DNN) have achieved impressive success in multiple domains. Over the years, the accuracy of these models has increased with the proliferation of deeper and more complex architectures. Thus, state-of-the-art solutions…

Sound · Computer Science 2022-07-18 Anderson R. Avila , Khalil Bibi , Rui Heng Yang , Xinlin Li , Chao Xing , Xiao Chen

DRACO: Co-Optimizing Hardware Utilization, and Performance of DNNs on Systolic Accelerator

The number of processing elements (PEs) in a fixed-sized systolic accelerator is well matched for large and compute-bound DNNs; whereas, memory-bound DNNs suffer from PE underutilization and fail to achieve peak performance and energy…

Signal Processing · Electrical Eng. & Systems 2020-06-29 Nandan Kumar Jha , Shreyas Ravishankar , Sparsh Mittal , Arvind Kaushik , Dipan Mandal , Mahesh Chandra

FATE: Fast and Accurate Timing Error Prediction Framework for Low Power DNN Accelerator Design

Deep neural networks (DNN) are increasingly being accelerated on application-specific hardware such as the Google TPU designed especially for deep learning. Timing speculation is a promising approach to further increase the energy…

Machine Learning · Computer Science 2018-07-03 Jeff Zhang , Siddharth Garg