Related papers: FlatQuant: Flatness Matters for LLM Quantization

AffineQuant: Affine Transformation Quantization for Large Language Models

The significant resource requirements associated with Large-scale Language Models (LLMs) have generated considerable interest in the development of techniques aimed at compressing and accelerating neural networks. Among these techniques,…

Machine Learning · Computer Science 2024-03-20 Yuexiao Ma , Huixia Li , Xiawu Zheng , Feng Ling , Xuefeng Xiao , Rui Wang , Shilei Wen , Fei Chao , Rongrong Ji

FPTQuant: Function-Preserving Transforms for LLM Quantization

Large language models (LLMs) require substantial compute, and thus energy, at inference time. While quantizing weights and activations is effective at improving efficiency, naive quantization of LLMs can significantly degrade performance…

Machine Learning · Computer Science 2025-06-06 Boris van Breugel , Yelysei Bondarenko , Paul Whatmough , Markus Nagel

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, existing methods cannot maintain accuracy and hardware efficiency at the same…

Computation and Language · Computer Science 2024-04-03 Guangxuan Xiao , Ji Lin , Mickael Seznec , Hao Wu , Julien Demouth , Song Han

SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM

Large language models (LLMs) have shown remarkable capabilities in various tasks. However their huge model size and the consequent demand for computational and memory resources also pose challenges to model deployment. Currently, 4-bit…

Machine Learning · Computer Science 2023-12-08 Jiayi Pan , Chengcan Wang , Kaifu Zheng , Yangguang Li , Zhenyu Wang , Bin Feng

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

Large language models (LLMs) have revolutionized natural language processing tasks. However, their practical deployment is hindered by their immense memory and computation requirements. Although recent post-training quantization (PTQ)…

Machine Learning · Computer Science 2024-03-19 Wenqi Shao , Mengzhao Chen , Zhaoyang Zhang , Peng Xu , Lirui Zhao , Zhiqian Li , Kaipeng Zhang , Peng Gao , Yu Qiao , Ping Luo

PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization

Existing weight-activation quantization methods for Large Language Models (LLMs) primarily address channel-wise outliers but often neglect token-wise outliers, which limits the accuracy of quantized models. In this work, we propose…

Machine Learning · Computer Science 2025-01-28 Mengzhao Chen , Yi Liu , Jiahao Wang , Yi Bin , Wenqi Shao , Ping Luo

DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models

Improving the efficiency of inference in Large Language Models (LLMs) is a critical area of research. Post-training Quantization (PTQ) is a popular technique, but it often faces challenges at low-bit levels, particularly in downstream…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Wenjin Ke , Zhe Li , Dong Li , Lu Tian , Emad Barsoum

Adaptive Layer-Wise Transformations for Post-Training Quantization of Large Language Models

Large language models require significant computational resources for deployment, making quantization essential for practical applications. However, the main obstacle to effective quantization lies in systematic outliers in activations and…

Machine Learning · Computer Science 2025-11-25 Cuong Pham , Hoang Anh Dung , Cuong C. Nguyen , Trung Le , Gustavo Carneiro , Jianfei Cai , Thanh-Toan Do

OutlierTune: Efficient Channel-Wise Quantization for Large Language Models

Quantizing the activations of large language models (LLMs) has been a significant challenge due to the presence of structured outliers. Most existing methods focus on the per-token or per-tensor quantization of activations, making it…

Computation and Language · Computer Science 2024-06-28 Jinguang Wang , Yuexi Yin , Haifeng Sun , Qi Qi , Jingyu Wang , Zirui Zhuang , Tingting Yang , Jianxin Liao

SpinQuant: LLM quantization with learned rotations

Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs), but may lead to large quantization errors when…

Machine Learning · Computer Science 2025-02-21 Zechun Liu , Changsheng Zhao , Igor Fedorov , Bilge Soran , Dhruv Choudhary , Raghuraman Krishnamoorthi , Vikas Chandra , Yuandong Tian , Tijmen Blankevoort

LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid

Large language models (LLMs) have shown immense potential across various domains, but their high memory requirements and inference costs remain critical challenges for deployment. Post-training quantization (PTQ) has emerged as a promising…

Machine Learning · Computer Science 2026-01-05 Tianyi Zhang , Anshumali Shrivastava

FBQuant: FeedBack Quantization for Large Language Models

Deploying Large Language Models (LLMs) on edge devices is increasingly important, as it eliminates reliance on network connections, reduces expensive API calls, and enhances user privacy. However, on-device deployment is challenging due to…

Machine Learning · Computer Science 2025-05-26 Yijiang Liu , Hengyu Fang , Liulu He , Rongyu Zhang , Yichuan Bai , Yuan Du , Li Du

ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation

Rotation-based Post-Training Quantization (PTQ) has emerged as a promising solution for mitigating activation outliers in the quantization of Large Language Models (LLMs). Global rotation methods achieve inference efficiency by fusing…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Suyoung Kim , Sunghyun Wee , Hyeonjin Kim , Kyomin Hwang , Hyunho Lee , Nojun Kwak

FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization

The rapid advancement of large language models (LLMs) has exacerbated the memory bottleneck due to the widening gap between model parameter scaling and hardware capabilities. While post-training quantization techniques effectively reduce…

Machine Learning · Computer Science 2025-10-22 Fangxin Liu , Zongwu Wang , JinHong Xia , Junping Zhao , Shouren Zhao , Jinjin Li , Jian Liu , Li Jiang , Haibing Guan

A Comprehensive Evaluation on Quantization Techniques for Large Language Models

For large language models (LLMs), post-training quantization (PTQ) can significantly reduce memory footprint and computational overhead. Model quantization is rapidly evolving. Though many papers report breakthrough results, they are often…

Machine Learning · Computer Science 2026-01-30 Yutong Liu , Cairong Zhao , Guosheng Hu

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical…

Computation and Language · Computer Science 2024-04-09 Jing Liu , Ruihao Gong , Xiuying Wei , Zhiwei Dong , Jianfei Cai , Bohan Zhuang

NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models

Weight-only quantization has become a standard approach for efficiently serving large language models (LLMs). However, existing methods fail to efficiently compress models to binary (1-bit) levels, as they either require large amounts of…

Machine Learning · Computer Science 2026-05-19 Hyochan Chong , Dongkyu Kim , Changdong Kim , Minseop Choi

LLM-FP4: 4-Bit Floating-Point Quantized Transformers

We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values, in a post-training manner. Existing post-training quantization (PTQ) solutions are primarily integer-based…

Computation and Language · Computer Science 2024-04-30 Shih-yang Liu , Zechun Liu , Xijie Huang , Pingcheng Dong , Kwang-Ting Cheng

Low-Rank Quantization-Aware Training for LLMs

Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and…

Machine Learning · Computer Science 2024-09-04 Yelysei Bondarenko , Riccardo Del Chiaro , Markus Nagel

FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design

Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs, restricting their practical deployment. While existing INT4/INT8 quantization reduces these costs, they often degrade…

Machine Learning · Computer Science 2025-11-04 Hao Zhang , Aining Jia , Weifeng Bu , Yushu Cai , Kai Sheng , Hao Chen , Xin He