Related papers: I-BERT: Integer-only BERT Quantization

FP8-BERT: Post-Training Quantization for Transformer

Transformer-based models, such as BERT, have been widely applied in a wide range of natural language processing tasks. However, one inevitable side effect is that they require massive memory storage and inference cost when deployed in…

Artificial Intelligence · Computer Science 2023-12-13 Jianwei Li , Tianchi Zhang , Ian En-Hsu Yen , Dongkuan Xu

I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference

Vision Transformers (ViTs) have achieved state-of-the-art performance on various computer vision applications. However, these models have considerable storage and computational overheads, making their deployment and efficient inference on…

Computer Vision and Pattern Recognition · Computer Science 2023-08-08 Zhikai Li , Qingyi Gu

MKQ-BERT: Quantized BERT with 4-bits Weights and Activations

Recently, pre-trained Transformer based language models, such as BERT, have shown great superiority over the traditional methods in many Natural Language Processing (NLP) tasks. However, the computational cost for deploying these models is…

Machine Learning · Computer Science 2022-03-28 Hanlin Tang , Xipeng Zhang , Kai Liu , Jianchen Zhu , Zhanhui Kang

Q8BERT: Quantized 8Bit BERT

Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks. However, these models contain a large amount of parameters. The emergence of even…

Computation and Language · Computer Science 2021-12-20 Ofir Zafrir , Guy Boudoukh , Peter Izsak , Moshe Wasserblat

Optimizing Inference Performance of Transformers on CPUs

The Transformer architecture revolutionized the field of natural language processing (NLP). Transformers-based models (e.g., BERT) power many important Web services, such as search, translation, question-answering, etc. While enormous…

Computation and Language · Computer Science 2021-02-23 Dave Dice , Alex Kogan

Integer-Only Neural Network Quantization Scheme Based on Shift-Batch-Normalization

Neural networks are very popular in many areas, but great computing complexity makes it hard to run neural networks on devices with limited resources. To address this problem, quantization methods are used to reduce model size and…

Machine Learning · Computer Science 2021-06-02 Qingyu Guo , Yuan Wang , Xiaoxin Cui

Hardware Acceleration of Fully Quantized BERT for Efficient Natural Language Processing

BERT is the most recent Transformer-based model that achieves state-of-the-art performance in various NLP tasks. In this paper, we investigate the hardware acceleration of BERT on FPGA for edge computing. To tackle the issue of huge…

Hardware Architecture · Computer Science 2021-03-05 Zejian Liu , Gang Li , Jian Cheng

Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases

Improving the deployment efficiency of transformer-based language models has been challenging given their high computation and memory cost. While INT8 quantization has recently been shown to be effective in reducing both the memory cost and…

Computation and Language · Computer Science 2023-06-01 Xiaoxia Wu , Cheng Li , Reza Yazdani Aminabadi , Zhewei Yao , Yuxiong He

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

The rising popularity of intelligent mobile devices and the daunting computational cost of deep learning-based models call for efficient and accurate on-device inference schemes. We propose a quantization scheme that allows inference to be…

Machine Learning · Computer Science 2017-12-19 Benoit Jacob , Skirmantas Kligys , Bo Chen , Menglong Zhu , Matthew Tang , Andrew Howard , Hartwig Adam , Dmitry Kalenichenko

schuBERT: Optimizing Elements of BERT

Transformers \citep{vaswani2017attention} have gradually become a key component for many state-of-the-art natural language representation models. A recent Transformer based model- BERT \citep{devlin2018bert} achieved state-of-the-art…

Computation and Language · Computer Science 2020-05-15 Ashish Khetan , Zohar Karnin

Integer-only Zero-shot Quantization for Efficient Speech Recognition

End-to-end neural network models achieve improved performance on various automatic speech recognition (ASR) tasks. However, these models perform poorly on edge hardware due to large memory and computation requirements. While quantizing…

Audio and Speech Processing · Electrical Eng. & Systems 2022-02-01 Sehoon Kim , Amir Gholami , Zhewei Yao , Nicholas Lee , Patrick Wang , Aniruddha Nrusimha , Bohan Zhai , Tianren Gao , Michael W. Mahoney , Kurt Keutzer

Towards Fully 8-bit Integer Inference for the Transformer Model

8-bit integer inference, as a promising direction in reducing both the latency and storage of deep neural networks, has made great progress recently. On the other hand, previous systems still rely on 32-bit floating point for certain…

Computation and Language · Computer Science 2020-09-21 Ye Lin , Yanyang Li , Tengbo Liu , Tong Xiao , Tongran Liu , Jingbo Zhu

PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination

We develop a novel method, called PoWER-BERT, for improving the inference time of the popular BERT model, while maintaining the accuracy. It works by: a) exploiting redundancy pertaining to word-vectors (intermediate encoder outputs) and…

Machine Learning · Computer Science 2020-09-09 Saurabh Goyal , Anamitra R. Choudhury , Saurabh M. Raje , Venkatesan T. Chakaravarthy , Yogish Sabharwal , Ashish Verma

InTreeger: An End-to-End Framework for Integer-Only Decision Tree Inference

Integer quantization has emerged as a critical technique to facilitate deployment on resource-constrained devices. Although they do reduce the complexity of the learning models, their inference performance is often prone to…

Machine Learning · Computer Science 2025-05-22 Duncan Bart , Bruno Endres Forlin , Ana-Lucia Varbanescu , Marco Ottavi , Kuan-Hsun Chen

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have…

Computation and Language · Computer Science 2024-12-20 Benjamin Warner , Antoine Chaffin , Benjamin Clavié , Orion Weller , Oskar Hallström , Said Taghadouini , Alexis Gallagher , Raja Biswas , Faisal Ladhak , Tom Aarsen , Nathan Cooper , Griffin Adams , Jeremy Howard , Iacopo Poli

I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation

Vision Transformers (ViTs) have recently achieved strong results in semantic segmentation, yet their deployment on resource-constrained devices remains limited due to their high memory footprint and computational cost. Quantization offers…

Computer Vision and Pattern Recognition · Computer Science 2025-09-15 Jordan Sassoon , Michal Szczepanski , Martyna Poreba

I-BERT: Inductive Generalization of Transformer to Arbitrary Context Lengths

Self-attention has emerged as a vital component of state-of-the-art sequence-to-sequence models for natural language processing in recent years, brought to the forefront by pre-trained bi-directional Transformer models. Its effectiveness is…

Machine Learning · Computer Science 2020-06-23 Hyoungwook Nam , Seung Byum Seo , Vikram Sharma Mailthody , Noor Michael , Lan Li

TernaryBERT: Distillation-aware Ultra-low Bit BERT

Transformer-based pre-training models like BERT have achieved remarkable performance in many natural language processing tasks.However, these models are both computation and memory expensive, hindering their deployment to…

Computation and Language · Computer Science 2020-10-13 Wei Zhang , Lu Hou , Yichun Yin , Lifeng Shang , Xiao Chen , Xin Jiang , Qun Liu

DPBERT: Efficient Inference for BERT based on Dynamic Planning

Large-scale pre-trained language models such as BERT have contributed significantly to the development of NLP. However, those models require large computational resources, making it difficult to be applied to mobile devices where computing…

Computation and Language · Computer Science 2023-08-02 Weixin Wu , Hankz Hankui Zhuo

Quantized Transformer Language Model Implementations on Edge Devices

Large-scale transformer-based models like the Bidirectional Encoder Representations from Transformers (BERT) are widely used for Natural Language Processing (NLP) applications, wherein these models are initially pre-trained with a large…

Computation and Language · Computer Science 2023-10-09 Mohammad Wali Ur Rahman , Murad Mehrab Abrar , Hunter Gibbons Copening , Salim Hariri , Sicong Shao , Pratik Satam , Soheil Salehi