Q8BERT: Quantized 8Bit BERT

Ofir Zafrir; Guy Boudoukh; Peter Izsak; Moshe Wasserblat

doi:10.1109/EMC2-NIPS53020.2019.00016

Q8BERT: Quantized 8Bit BERT

Computation and Language 2021-12-20 v2 Machine Learning

Authors: Ofir Zafrir , Guy Boudoukh , Peter Izsak , Moshe Wasserblat

View on arXiv ↗ PDF ↗ DOI ↗

Abstract

Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks. However, these models contain a large amount of parameters. The emergence of even larger and more accurate models such as GPT2 and Megatron, suggest a trend of large pre-trained Transformer models. However, using these large models in production environments is a complex task requiring a large amount of compute, memory and power resources. In this work we show how to perform quantization-aware training during the fine-tuning phase of BERT in order to compress BERT by $4\times$ with minimal accuracy loss. Furthermore, the produced quantized model can accelerate inference speed if it is optimized for 8bit Integer supporting hardware.

Keywords

pre-trained language model bert quantization

Cite

@article{arxiv.1910.06188,
  title  = {Q8BERT: Quantized 8Bit BERT},
  author = {Ofir Zafrir and Guy Boudoukh and Peter Izsak and Moshe Wasserblat},
  journal= {arXiv preprint arXiv:1910.06188},
  year   = {2021}
}

Comments

5 Pages, Accepted at the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS 2019

Q8BERT: Quantized 8Bit BERT

Abstract

Keywords

Cite

Comments

Related papers