English
Related papers

Related papers: Efficient Post-training Quantization with FP8 Form…

200 papers

Neural network quantization is widely used to reduce model inference complexity in real-world deployments. However, traditional integer quantization suffers from accuracy degradation when adapting to various dynamic ranges. Recent research…

Performance · Computer Science 2023-10-30 Zhuoyi Zhang , Yunchen Zhang , Gonglei Shi , Yu Shen , Ruihao Gong , Xiaoxu Xia , Qi Zhang , Lewei Lu , Xianglong Liu

Transformer-based models, such as BERT, have been widely applied in a wide range of natural language processing tasks. However, one inevitable side effect is that they require massive memory storage and inference cost when deployed in…

Artificial Intelligence · Computer Science 2023-12-13 Jianwei Li , Tianchi Zhang , Ian En-Hsu Yen , Dongkuan Xu

FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two…

For large language models (LLMs), post-training quantization (PTQ) can significantly reduce memory footprint and computational overhead. Model quantization is rapidly evolving. Though many papers report breakthrough results, they are often…

Machine Learning · Computer Science 2026-01-30 Yutong Liu , Cairong Zhao , Guosheng Hu

The burgeoning computational demands for training large language models (LLMs) necessitate efficient methods, including quantized training, which leverages low-bit arithmetic operations to reduce costs. While FP8 precision has shown…

Machine Learning · Computer Science 2025-02-18 Jiecheng Zhou , Ding Tang , Rong Fu , Boni Hu , Haoran Xu , Yi Wang , Zhilin Pei , Zhongling Su , Liang Liu , Xingcheng Zhang , Weiming Zhang

Recently, the idea of using FP8 as a number format for neural network training has been floating around the deep learning world. Given that most training is currently conducted with entire networks in FP32, or sometimes FP16 with…

This paper presents a comprehensive analysis of quantization techniques for optimizing Large Language Models (LLMs), specifically focusing on Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). Through empirical…

Machine Learning · Computer Science 2024-11-12 Jahid Hasan

Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across different formats remain unclear. In this paper, we conduct the most comprehensive empirical study to…

Machine Learning · Computer Science 2026-05-27 Eldar Kurtic , Alexandre Marques , Shubhra Pandit , Mark Kurtz , Dan Alistarh

In the era of large-scale language models, the substantial parameter size poses significant challenges for deployment. Being a prevalent compression technique, quantization has emerged as the mainstream practice to tackle this issue, which…

Computation and Language · Computer Science 2023-08-31 Qingyuan Li , Yifan Zhang , Liang Li , Peng Yao , Bo Zhang , Xiangxiang Chu , Yerui Sun , Li Du , Yuchen Xie

The growing computational demands of training large language models (LLMs) necessitate more efficient methods. Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce these costs. While FP8…

Machine Learning · Computer Science 2026-05-18 Ruizhe Wang , Yeyun Gong , Xiao Liu , Guoshuai Zhao , Ziyue Yang , Baining Guo , Zhengjun Zha , Peng Cheng

Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point…

Computer Vision and Pattern Recognition · Computer Science 2024-07-08 Shivam Aggarwal , Hans Jakob Damsgaard , Alessandro Pappalardo , Giuseppe Franco , Thomas B. Preußer , Michaela Blott , Tulika Mitra

When quantizing neural networks for efficient inference, low-bit integers are the go-to format for efficiency. However, low-bit floating point numbers have an extra degree of freedom, assigning some bits to work on an exponential scale…

Machine Learning · Computer Science 2024-02-26 Andrey Kuzmin , Mart Van Baalen , Yuwei Ren , Markus Nagel , Jorn Peters , Tijmen Blankevoort

We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values, in a post-training manner. Existing post-training quantization (PTQ) solutions are primarily integer-based…

Computation and Language · Computer Science 2024-04-30 Shih-yang Liu , Zechun Liu , Xijie Huang , Pingcheng Dong , Kwang-Ting Cheng

In the complex domain of large language models (LLMs), striking a balance between computational efficiency and maintaining model quality is a formidable challenge. Navigating the inherent limitations of uniform quantization, particularly…

Machine Learning · Computer Science 2023-07-24 Xiaoxia Wu , Zhewei Yao , Yuxiong He

We demonstrate, for the first time, fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients on datasets up to 200 billion tokens. We…

Machine Learning · Computer Science 2025-08-12 Brian Chmiel , Maxim Fishman , Ron Banner , Daniel Soudry

Deep neural networks have achieved state-of-the-art results in a wide range of applications, from natural language processing and computer vision to speech recognition. However, as tasks become increasingly complex, model sizes continue to…

Computer Vision and Pattern Recognition · Computer Science 2025-05-21 Tomer Gafni , Asaf Karnieli , Yair Hanani

Post-training quantization (PTQ) has emerged as a promising technique to reduce the cost of large language models (LLMs). Specifically, PTQ can effectively mitigate memory consumption and reduce computational overhead in LLMs. To meet the…

Computation and Language · Computer Science 2024-06-07 Shiyao Li , Xuefei Ning , Luning Wang , Tengxuan Liu , Xiangsheng Shi , Shengen Yan , Guohao Dai , Huazhong Yang , Yu Wang

Convolutional neural networks require significant memory bandwidth and storage for intermediate computations, apart from substantial computing resources. Neural network quantization has significant benefits in reducing the amount of…

Computer Vision and Pattern Recognition · Computer Science 2019-05-30 Ron Banner , Yury Nahshan , Elad Hoffer , Daniel Soudry

The size of a model has been a strong predictor of its quality, as well as its cost. As such, the trade-off between model cost and quality has been well-studied. Post-training optimizations like quantization and pruning have typically…

Machine Learning · Computer Science 2025-08-29 Giuseppe Franco , Pablo Monteagudo-Lago , Ian Colbert , Nicholas Fraser , Michaela Blott

Quantization has gained attention as a promising solution for the cost-effective deployment of large and small language models. However, most prior work has been limited to perplexity or basic knowledge tasks and lacks a comprehensive…

Computation and Language · Computer Science 2025-06-05 Jemin Lee , Sihyeong Park , Jinse Kwon , Jihun Oh , Yongin Kwon
‹ Prev 1 2 3 10 Next ›