Related papers: Efficient Post-training Quantization with FP8 Form…

Exploring the Potential of Flexible 8-bit Format: Design and Algorithm

Neural network quantization is widely used to reduce model inference complexity in real-world deployments. However, traditional integer quantization suffers from accuracy degradation when adapting to various dynamic ranges. Recent research…

Performance · Computer Science 2023-10-30 Zhuoyi Zhang , Yunchen Zhang , Gonglei Shi , Yu Shen , Ruihao Gong , Xiaoxu Xia , Qi Zhang , Lewei Lu , Xianglong Liu

FP8-BERT: Post-Training Quantization for Transformer

Transformer-based models, such as BERT, have been widely applied in a wide range of natural language processing tasks. However, one inevitable side effect is that they require massive memory storage and inference cost when deployed in…

Artificial Intelligence · Computer Science 2023-12-13 Jianwei Li , Tianchi Zhang , Ian En-Hsu Yen , Dongkuan Xu

FP8 Formats for Deep Learning

FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two…

Machine Learning · Computer Science 2022-10-03 Paulius Micikevicius , Dusan Stosic , Neil Burgess , Marius Cornea , Pradeep Dubey , Richard Grisenthwaite , Sangwon Ha , Alexander Heinecke , Patrick Judd , John Kamalu , Naveen Mellempudi , Stuart Oberman , Mohammad Shoeybi , Michael Siu , Hao Wu

A Comprehensive Evaluation on Quantization Techniques for Large Language Models

For large language models (LLMs), post-training quantization (PTQ) can significantly reduce memory footprint and computational overhead. Model quantization is rapidly evolving. Though many papers report breakthrough results, they are often…

Machine Learning · Computer Science 2026-01-30 Yutong Liu , Cairong Zhao , Guosheng Hu

Towards Efficient Pre-training: Exploring FP4 Precision in Large Language Models

The burgeoning computational demands for training large language models (LLMs) necessitate efficient methods, including quantized training, which leverages low-bit arithmetic operations to reduce costs. While FP8 precision has shown…

Machine Learning · Computer Science 2025-02-18 Jiecheng Zhou , Ding Tang , Rong Fu , Boni Hu , Haoran Xu , Yi Wang , Zhilin Pei , Zhongling Su , Liang Liu , Xingcheng Zhang , Weiming Zhang

FP8 versus INT8 for efficient deep learning inference

Recently, the idea of using FP8 as a number format for neural network training has been floating around the deep learning world. Given that most training is currently conducted with entire networks in FP32, or sometimes FP16 with…

Machine Learning · Computer Science 2023-06-16 Mart van Baalen , Andrey Kuzmin , Suparna S Nair , Yuwei Ren , Eric Mahurin , Chirag Patel , Sundar Subramanian , Sanghyuk Lee , Markus Nagel , Joseph Soriaga , Tijmen Blankevoort

Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques

This paper presents a comprehensive analysis of quantization techniques for optimizing Large Language Models (LLMs), specifically focusing on Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). Through empirical…

Machine Learning · Computer Science 2024-11-12 Jahid Hasan

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across different formats remain unclear. In this paper, we conduct the most comprehensive empirical study to…

Machine Learning · Computer Science 2026-05-27 Eldar Kurtic , Alexandre Marques , Shubhra Pandit , Mark Kurtz , Dan Alistarh

FPTQ: Fine-grained Post-Training Quantization for Large Language Models

In the era of large-scale language models, the substantial parameter size poses significant challenges for deployment. Being a prevalent compression technique, quantization has emerged as the mainstream practice to tackle this issue, which…

Computation and Language · Computer Science 2023-08-31 Qingyuan Li , Yifan Zhang , Liang Li , Peng Yao , Bo Zhang , Xiangxiang Chu , Yerui Sun , Li Du , Yuchen Xie

Optimizing Large Language Model Training Using FP4 Quantization

The growing computational demands of training large language models (LLMs) necessitate more efficient methods. Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce these costs. While FP8…

Machine Learning · Computer Science 2026-05-18 Ruizhe Wang , Yeyun Gong , Xiao Liu , Guoshuai Zhao , Ziyue Yang , Baining Guo , Zhengjun Zha , Peng Cheng

Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs

Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point…

Computer Vision and Pattern Recognition · Computer Science 2024-07-08 Shivam Aggarwal , Hans Jakob Damsgaard , Alessandro Pappalardo , Giuseppe Franco , Thomas B. Preußer , Michaela Blott , Tulika Mitra

FP8 Quantization: The Power of the Exponent

When quantizing neural networks for efficient inference, low-bit integers are the go-to format for efficiency. However, low-bit floating point numbers have an extra degree of freedom, assigning some bits to work on an exponential scale…

Machine Learning · Computer Science 2024-02-26 Andrey Kuzmin , Mart Van Baalen , Yuwei Ren , Markus Nagel , Jorn Peters , Tijmen Blankevoort

LLM-FP4: 4-Bit Floating-Point Quantized Transformers

We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values, in a post-training manner. Existing post-training quantization (PTQ) solutions are primarily integer-based…

Computation and Language · Computer Science 2024-04-30 Shih-yang Liu , Zechun Liu , Xijie Huang , Pingcheng Dong , Kwang-Ting Cheng

ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats

In the complex domain of large language models (LLMs), striking a balance between computational efficiency and maintaining model quality is a formidable challenge. Navigating the inherent limitations of uniform quantization, particularly…

Machine Learning · Computer Science 2023-07-24 Xiaoxia Wu , Zhewei Yao , Yuxiong He

FP4 All the Way: Fully Quantized Training of LLMs

We demonstrate, for the first time, fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients on datasets up to 200 billion tokens. We…

Machine Learning · Computer Science 2025-08-12 Brian Chmiel , Maxim Fishman , Ron Banner , Daniel Soudry

Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference

Deep neural networks have achieved state-of-the-art results in a wide range of applications, from natural language processing and computer vision to speech recognition. However, as tasks become increasingly complex, model sizes continue to…

Computer Vision and Pattern Recognition · Computer Science 2025-05-21 Tomer Gafni , Asaf Karnieli , Yair Hanani

Evaluating Quantized Large Language Models

Post-training quantization (PTQ) has emerged as a promising technique to reduce the cost of large language models (LLMs). Specifically, PTQ can effectively mitigate memory consumption and reduce computational overhead in LLMs. To meet the…

Computation and Language · Computer Science 2024-06-07 Shiyao Li , Xuefei Ning , Luning Wang , Tengxuan Liu , Xiangsheng Shi , Shengen Yan , Guohao Dai , Huazhong Yang , Yu Wang

Post-training 4-bit quantization of convolution networks for rapid-deployment

Convolutional neural networks require significant memory bandwidth and storage for intermediate computations, apart from substantial computing resources. Neural network quantization has significant benefits in reducing the amount of…

Computer Vision and Pattern Recognition · Computer Science 2019-05-30 Ron Banner , Yury Nahshan , Elad Hoffer , Daniel Soudry

Improving Quantization with Post-Training Model Expansion

The size of a model has been a strong predictor of its quality, as well as its cost. As such, the trade-off between model cost and quality has been well-studied. Post-training optimizations like quantization and pruning have typically…

Machine Learning · Computer Science 2025-08-29 Giuseppe Franco , Pablo Monteagudo-Lago , Ian Colbert , Nicholas Fraser , Michaela Blott

Exploring the Trade-Offs: Quantization Methods, Task Difficulty, and Model Size in Large Language Models From Edge to Giant

Quantization has gained attention as a promising solution for the cost-effective deployment of large and small language models. However, most prior work has been limited to perplexity or basic knowledge tasks and lacks a comprehensive…

Computation and Language · Computer Science 2025-06-05 Jemin Lee , Sihyeong Park , Jinse Kwon , Jihun Oh , Yongin Kwon