English
Related papers

Related papers: I-BERT: Integer-only BERT Quantization

200 papers

Transformer-based models, such as BERT, have been widely applied in a wide range of natural language processing tasks. However, one inevitable side effect is that they require massive memory storage and inference cost when deployed in…

Artificial Intelligence · Computer Science 2023-12-13 Jianwei Li , Tianchi Zhang , Ian En-Hsu Yen , Dongkuan Xu

Vision Transformers (ViTs) have achieved state-of-the-art performance on various computer vision applications. However, these models have considerable storage and computational overheads, making their deployment and efficient inference on…

Computer Vision and Pattern Recognition · Computer Science 2023-08-08 Zhikai Li , Qingyi Gu

Recently, pre-trained Transformer based language models, such as BERT, have shown great superiority over the traditional methods in many Natural Language Processing (NLP) tasks. However, the computational cost for deploying these models is…

Machine Learning · Computer Science 2022-03-28 Hanlin Tang , Xipeng Zhang , Kai Liu , Jianchen Zhu , Zhanhui Kang

Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks. However, these models contain a large amount of parameters. The emergence of even…

Computation and Language · Computer Science 2021-12-20 Ofir Zafrir , Guy Boudoukh , Peter Izsak , Moshe Wasserblat

The Transformer architecture revolutionized the field of natural language processing (NLP). Transformers-based models (e.g., BERT) power many important Web services, such as search, translation, question-answering, etc. While enormous…

Computation and Language · Computer Science 2021-02-23 Dave Dice , Alex Kogan

Neural networks are very popular in many areas, but great computing complexity makes it hard to run neural networks on devices with limited resources. To address this problem, quantization methods are used to reduce model size and…

Machine Learning · Computer Science 2021-06-02 Qingyu Guo , Yuan Wang , Xiaoxin Cui

BERT is the most recent Transformer-based model that achieves state-of-the-art performance in various NLP tasks. In this paper, we investigate the hardware acceleration of BERT on FPGA for edge computing. To tackle the issue of huge…

Hardware Architecture · Computer Science 2021-03-05 Zejian Liu , Gang Li , Jian Cheng

Improving the deployment efficiency of transformer-based language models has been challenging given their high computation and memory cost. While INT8 quantization has recently been shown to be effective in reducing both the memory cost and…

Computation and Language · Computer Science 2023-06-01 Xiaoxia Wu , Cheng Li , Reza Yazdani Aminabadi , Zhewei Yao , Yuxiong He

The rising popularity of intelligent mobile devices and the daunting computational cost of deep learning-based models call for efficient and accurate on-device inference schemes. We propose a quantization scheme that allows inference to be…

Machine Learning · Computer Science 2017-12-19 Benoit Jacob , Skirmantas Kligys , Bo Chen , Menglong Zhu , Matthew Tang , Andrew Howard , Hartwig Adam , Dmitry Kalenichenko

Transformers \citep{vaswani2017attention} have gradually become a key component for many state-of-the-art natural language representation models. A recent Transformer based model- BERT \citep{devlin2018bert} achieved state-of-the-art…

Computation and Language · Computer Science 2020-05-15 Ashish Khetan , Zohar Karnin

End-to-end neural network models achieve improved performance on various automatic speech recognition (ASR) tasks. However, these models perform poorly on edge hardware due to large memory and computation requirements. While quantizing…

Audio and Speech Processing · Electrical Eng. & Systems 2022-02-01 Sehoon Kim , Amir Gholami , Zhewei Yao , Nicholas Lee , Patrick Wang , Aniruddha Nrusimha , Bohan Zhai , Tianren Gao , Michael W. Mahoney , Kurt Keutzer

8-bit integer inference, as a promising direction in reducing both the latency and storage of deep neural networks, has made great progress recently. On the other hand, previous systems still rely on 32-bit floating point for certain…

Computation and Language · Computer Science 2020-09-21 Ye Lin , Yanyang Li , Tengbo Liu , Tong Xiao , Tongran Liu , Jingbo Zhu

We develop a novel method, called PoWER-BERT, for improving the inference time of the popular BERT model, while maintaining the accuracy. It works by: a) exploiting redundancy pertaining to word-vectors (intermediate encoder outputs) and…

Integer quantization has emerged as a critical technique to facilitate deployment on resource-constrained devices. Although they do reduce the complexity of the learning models, their inference performance is often prone to…

Machine Learning · Computer Science 2025-05-22 Duncan Bart , Bruno Endres Forlin , Ana-Lucia Varbanescu , Marco Ottavi , Kuan-Hsun Chen

Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have…

Vision Transformers (ViTs) have recently achieved strong results in semantic segmentation, yet their deployment on resource-constrained devices remains limited due to their high memory footprint and computational cost. Quantization offers…

Computer Vision and Pattern Recognition · Computer Science 2025-09-15 Jordan Sassoon , Michal Szczepanski , Martyna Poreba

Self-attention has emerged as a vital component of state-of-the-art sequence-to-sequence models for natural language processing in recent years, brought to the forefront by pre-trained bi-directional Transformer models. Its effectiveness is…

Machine Learning · Computer Science 2020-06-23 Hyoungwook Nam , Seung Byum Seo , Vikram Sharma Mailthody , Noor Michael , Lan Li

Transformer-based pre-training models like BERT have achieved remarkable performance in many natural language processing tasks.However, these models are both computation and memory expensive, hindering their deployment to…

Computation and Language · Computer Science 2020-10-13 Wei Zhang , Lu Hou , Yichun Yin , Lifeng Shang , Xiao Chen , Xin Jiang , Qun Liu

Large-scale pre-trained language models such as BERT have contributed significantly to the development of NLP. However, those models require large computational resources, making it difficult to be applied to mobile devices where computing…

Computation and Language · Computer Science 2023-08-02 Weixin Wu , Hankz Hankui Zhuo

Large-scale transformer-based models like the Bidirectional Encoder Representations from Transformers (BERT) are widely used for Natural Language Processing (NLP) applications, wherein these models are initially pre-trained with a large…

Computation and Language · Computer Science 2023-10-09 Mohammad Wali Ur Rahman , Murad Mehrab Abrar , Hunter Gibbons Copening , Salim Hariri , Sicong Shao , Pratik Satam , Soheil Salehi
‹ Prev 1 2 3 10 Next ›