English
Related papers

Related papers: Learning Accurate Integer Transformer Machine-Tran…

200 papers

8-bit integer inference, as a promising direction in reducing both the latency and storage of deep neural networks, has made great progress recently. On the other hand, previous systems still rely on 32-bit floating point for certain…

Computation and Language · Computer Science 2020-09-21 Ye Lin , Yanyang Li , Tengbo Liu , Tong Xiao , Tongran Liu , Jingbo Zhu

In this work, we quantize a trained Transformer machine language translation model leveraging INT8/VNNI instructions in the latest Intel$^\circledR$ Xeon$^\circledR$ Cascade Lake processors to improve inference performance while maintaining…

Machine Learning · Computer Science 2019-06-10 Aishwarya Bhandare , Vamsi Sripathi , Deepthi Karkada , Vivek Menon , Sun Choi , Kushal Datta , Vikram Saletore

Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory…

Machine Learning · Computer Science 2022-11-11 Tim Dettmers , Mike Lewis , Younes Belkada , Luke Zettlemoyer

Recently, the idea of using FP8 as a number format for neural network training has been floating around the deep learning world. Given that most training is currently conducted with entire networks in FP32, or sometimes FP16 with…

FP8 formats are gaining popularity to boost the computational efficiency for training and inference of large deep learning models. Their main challenge is that a careful choice of scaling is needed to prevent degradation due to the reduced…

Large Language Model training with 8-bit floating point (FP8) formats promises significant efficiency improvements, but reduced numerical precision makes training challenging. It is currently possible to train in FP8 only if one is willing…

Machine Learning · Computer Science 2025-06-06 Saaketh Narayan , Abhay Gupta , Mansheej Paul , Davis Blalock

Quantizing the activation, weight, and gradient to 4-bit is promising to accelerate neural network training. However, existing 4-bit training methods require custom numerical formats which are not supported by contemporary hardware. In this…

Machine Learning · Computer Science 2023-06-26 Haocheng Xi , Changhao Li , Jianfei Chen , Jun Zhu

Transformer-based models, such as BERT, have been widely applied in a wide range of natural language processing tasks. However, one inevitable side effect is that they require massive memory storage and inference cost when deployed in…

Artificial Intelligence · Computer Science 2023-12-13 Jianwei Li , Tianchi Zhang , Ian En-Hsu Yen , Dongkuan Xu

Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language Processing tasks. However, their memory footprint, inference latency, and power consumption are prohibitive efficient inference…

Computation and Language · Computer Science 2022-05-02 Sehoon Kim , Amir Gholami , Zhewei Yao , Michael W. Mahoney , Kurt Keutzer

Large Transformer models have achieved state-of-the-art results in neural machine translation and have become standard in the field. In this work, we look for the optimal combination of known techniques to optimize inference speed without…

Computation and Language · Computer Science 2020-10-08 Yi-Te Hsu , Sarthak Garg , Yi-Hsiu Liao , Ilya Chatsviorkin

Transformer is the state-of-the-art model in recent machine translation evaluations. Two strands of research are promising to improve models of this kind: the first uses wide networks (a.k.a. Transformer-Big) and has been the de facto…

Computation and Language · Computer Science 2019-06-06 Qiang Wang , Bei Li , Tong Xiao , Jingbo Zhu , Changliang Li , Derek F. Wong , Lidia S. Chao

Pretraining transformers are generally time-consuming. Fully quantized training (FQT) is a promising approach to speed up pretraining. However, most FQT methods adopt a quantize-compute-dequantize procedure, which often leads to suboptimal…

Machine Learning · Computer Science 2024-07-23 Haocheng Xi , Yuxiang Chen , Kang Zhao , Kai Jun Teh , Jianfei Chen , Jun Zhu

We explore the application of very deep Transformer models for Neural Machine Translation (NMT). Using a simple yet effective initialization technique that stabilizes training, we show that it is feasible to build standard Transformer-based…

Computation and Language · Computer Science 2020-10-16 Xiaodong Liu , Kevin Duh , Liyuan Liu , Jianfeng Gao

Deep encoders have been proven to be effective in improving neural machine translation (NMT) systems, but training an extremely deep encoder is time consuming. Moreover, why deep models help NMT is an open question. In this paper, we…

Computation and Language · Computer Science 2020-10-09 Bei Li , Ziyang Wang , Hui Liu , Yufan Jiang , Quan Du , Tong Xiao , Huizhen Wang , Jingbo Zhu

Transformer models have achieved remarkable success across various AI applications but face significant training costs. Low-bit training, such as INT8 training, can leverage computational units with higher throughput, and has already…

Machine Learning · Computer Science 2025-06-10 Pengle Zhang , Jia Wei , Jintao Zhang , Jun Zhu , Jianfei Chen

Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks. However, these models contain a large amount of parameters. The emergence of even…

Computation and Language · Computer Science 2021-12-20 Ofir Zafrir , Guy Boudoukh , Peter Izsak , Moshe Wasserblat

Low-precision data types are essential in modern neural networks during both training and inference as they enhance throughput and computational capacity by better exploiting available hardware resources. Despite the incorporation of FP8 in…

Attention-based models have shown significant improvement over traditional algorithms in several NLP tasks. The Transformer, for instance, is an illustrative example that generates abstract representations of tokens inputted to an encoder…

Computation and Language · Computer Science 2019-11-15 Dhanasekar Sundararaman , Vivek Subramanian , Guoyin Wang , Shijing Si , Dinghan Shen , Dong Wang , Lawrence Carin

Sequence to sequence learning models still require several days to reach state of the art performance on large benchmark datasets using a single machine. This paper shows that reduced precision and large batch training can speedup training…

Computation and Language · Computer Science 2018-09-06 Myle Ott , Sergey Edunov , David Grangier , Michael Auli

The immense computational cost of training Large Language Models (LLMs) presents a major barrier to innovation. While FP8 training offers a promising solution with significant theoretical efficiency gains, its widespread adoption has been…

Computation and Language · Computer Science 2025-10-20 Wenjun Wang , Shuo Cai , Congkai Xie , Mingfa Feng , Yiming Zhang , Zhen Li , Kejing Yang , Ming Li , Jiannong Cao , Hongxia Yang
‹ Prev 1 2 3 10 Next ›