Related papers: Learning Accurate Integer Transformer Machine-Tran…

Towards Fully 8-bit Integer Inference for the Transformer Model

8-bit integer inference, as a promising direction in reducing both the latency and storage of deep neural networks, has made great progress recently. On the other hand, previous systems still rely on 32-bit floating point for certain…

Computation and Language · Computer Science 2020-09-21 Ye Lin , Yanyang Li , Tengbo Liu , Tong Xiao , Tongran Liu , Jingbo Zhu

Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model

In this work, we quantize a trained Transformer machine language translation model leveraging INT8/VNNI instructions in the latest Intel$^\circledR$ Xeon$^\circledR$ Cascade Lake processors to improve inference performance while maintaining…

Machine Learning · Computer Science 2019-06-10 Aishwarya Bhandare , Vamsi Sripathi , Deepthi Karkada , Vivek Menon , Sun Choi , Kushal Datta , Vikram Saletore

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory…

Machine Learning · Computer Science 2022-11-11 Tim Dettmers , Mike Lewis , Younes Belkada , Luke Zettlemoyer

FP8 versus INT8 for efficient deep learning inference

Recently, the idea of using FP8 as a number format for neural network training has been floating around the deep learning world. Given that most training is currently conducted with entire networks in FP32, or sometimes FP16 with…

Machine Learning · Computer Science 2023-06-16 Mart van Baalen , Andrey Kuzmin , Suparna S Nair , Yuwei Ren , Eric Mahurin , Chirag Patel , Sundar Subramanian , Sanghyuk Lee , Markus Nagel , Joseph Soriaga , Tijmen Blankevoort

Training and inference of large language models using 8-bit floating point

FP8 formats are gaining popularity to boost the computational efficiency for training and inference of large deep learning models. Their main challenge is that a careful choice of scaling is needed to prevent degradation due to the reduced…

Machine Learning · Computer Science 2023-10-02 Sergio P. Perez , Yan Zhang , James Briggs , Charlie Blake , Josh Levy-Kramer , Paul Balanca , Carlo Luschi , Stephen Barlow , Andrew William Fitzgibbon

$\mu$nit Scaling: Simple and Scalable FP8 LLM Training

Large Language Model training with 8-bit floating point (FP8) formats promises significant efficiency improvements, but reduced numerical precision makes training challenging. It is currently possible to train in FP8 only if one is willing…

Machine Learning · Computer Science 2025-06-06 Saaketh Narayan , Abhay Gupta , Mansheej Paul , Davis Blalock

Training Transformers with 4-bit Integers

Quantizing the activation, weight, and gradient to 4-bit is promising to accelerate neural network training. However, existing 4-bit training methods require custom numerical formats which are not supported by contemporary hardware. In this…

Machine Learning · Computer Science 2023-06-26 Haocheng Xi , Changhao Li , Jianfei Chen , Jun Zhu

FP8-BERT: Post-Training Quantization for Transformer

Transformer-based models, such as BERT, have been widely applied in a wide range of natural language processing tasks. However, one inevitable side effect is that they require massive memory storage and inference cost when deployed in…

Artificial Intelligence · Computer Science 2023-12-13 Jianwei Li , Tianchi Zhang , Ian En-Hsu Yen , Dongkuan Xu

I-BERT: Integer-only BERT Quantization

Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language Processing tasks. However, their memory footprint, inference latency, and power consumption are prohibitive efficient inference…

Computation and Language · Computer Science 2022-05-02 Sehoon Kim , Amir Gholami , Zhewei Yao , Michael W. Mahoney , Kurt Keutzer

Efficient Inference For Neural Machine Translation

Large Transformer models have achieved state-of-the-art results in neural machine translation and have become standard in the field. In this work, we look for the optimal combination of known techniques to optimize inference speed without…

Computation and Language · Computer Science 2020-10-08 Yi-Te Hsu , Sarthak Garg , Yi-Hsiu Liao , Ilya Chatsviorkin

Learning Deep Transformer Models for Machine Translation

Transformer is the state-of-the-art model in recent machine translation evaluations. Two strands of research are promising to improve models of this kind: the first uses wide networks (a.k.a. Transformer-Big) and has been the de facto…

Computation and Language · Computer Science 2019-06-06 Qiang Wang , Bei Li , Tong Xiao , Jingbo Zhu , Changliang Li , Derek F. Wong , Lidia S. Chao

Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization

Pretraining transformers are generally time-consuming. Fully quantized training (FQT) is a promising approach to speed up pretraining. However, most FQT methods adopt a quantize-compute-dequantize procedure, which often leads to suboptimal…

Machine Learning · Computer Science 2024-07-23 Haocheng Xi , Yuxiang Chen , Kang Zhao , Kai Jun Teh , Jianfei Chen , Jun Zhu

Very Deep Transformers for Neural Machine Translation

We explore the application of very deep Transformer models for Neural Machine Translation (NMT). Using a simple yet effective initialization technique that stabilizes training, we show that it is feasible to build standard Transformer-based…

Computation and Language · Computer Science 2020-10-16 Xiaodong Liu , Kevin Duh , Liyuan Liu , Jianfeng Gao

Shallow-to-Deep Training for Neural Machine Translation

Deep encoders have been proven to be effective in improving neural machine translation (NMT) systems, but training an extremely deep encoder is time consuming. Moreover, why deep models help NMT is an open question. In this paper, we…

Computation and Language · Computer Science 2020-10-09 Bei Li , Ziyang Wang , Hui Liu , Yufan Jiang , Quan Du , Tong Xiao , Huizhen Wang , Jingbo Zhu

Accurate INT8 Training Through Dynamic Block-Level Fallback

Transformer models have achieved remarkable success across various AI applications but face significant training costs. Low-bit training, such as INT8 training, can leverage computational units with higher throughput, and has already…

Machine Learning · Computer Science 2025-06-10 Pengle Zhang , Jia Wei , Jintao Zhang , Jun Zhu , Jianfei Chen

Q8BERT: Quantized 8Bit BERT

Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks. However, these models contain a large amount of parameters. The emergence of even…

Computation and Language · Computer Science 2021-12-20 Ofir Zafrir , Guy Boudoukh , Peter Izsak , Moshe Wasserblat

Faster Inference of LLMs using FP8 on the Intel Gaudi

Low-precision data types are essential in modern neural networks during both training and inference as they enhance throughput and computational capacity by better exploiting available hardware resources. Despite the incorporation of FP8 in…

Hardware Architecture · Computer Science 2025-03-18 Joonhyung Lee , Shmulik Markovich-Golan , Daniel Ohayon , Yair Hanani , Gunho Park , Byeongwook Kim , Asaf Karnieli , Uri Livne , Haihao Shen , Tai Huang , Se Jung Kwon , Dongsoo Lee

Syntax-Infused Transformer and BERT models for Machine Translation and Natural Language Understanding

Attention-based models have shown significant improvement over traditional algorithms in several NLP tasks. The Transformer, for instance, is an illustrative example that generates abstract representations of tokens inputted to an encoder…

Computation and Language · Computer Science 2019-11-15 Dhanasekar Sundararaman , Vivek Subramanian , Guoyin Wang , Shijing Si , Dinghan Shen , Dong Wang , Lawrence Carin

Scaling Neural Machine Translation

Sequence to sequence learning models still require several days to reach state of the art performance on large benchmark datasets using a single machine. This paper shows that reduced precision and large batch training can speedup training…

Computation and Language · Computer Science 2018-09-06 Myle Ott , Sergey Edunov , David Grangier , Michael Auli

InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models

The immense computational cost of training Large Language Models (LLMs) presents a major barrier to innovation. While FP8 training offers a promising solution with significant theoretical efficiency gains, its widespread adoption has been…

Computation and Language · Computer Science 2025-10-20 Wenjun Wang , Shuo Cai , Congkai Xie , Mingfa Feng , Yiming Zhang , Zhen Li , Kejing Yang , Ming Li , Jiannong Cao , Hongxia Yang