Related papers: FBQuant: FeedBack Quantization for Large Language …

MobileQuant: Mobile-friendly Quantization for On-device Language Models

Large language models (LLMs) have revolutionized language processing, delivering outstanding results across multiple applications. However, deploying LLMs on edge devices poses several challenges with respect to memory, energy, and compute…

Computation and Language · Computer Science 2024-10-07 Fuwen Tan , Royson Lee , Łukasz Dudziak , Shell Xu Hu , Sourav Bhattacharya , Timothy Hospedales , Georgios Tzimiropoulos , Brais Martinez

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

Large language models (LLMs) have revolutionized natural language processing tasks. However, their practical deployment is hindered by their immense memory and computation requirements. Although recent post-training quantization (PTQ)…

Machine Learning · Computer Science 2024-03-19 Wenqi Shao , Mengzhao Chen , Zhaoyang Zhang , Peng Xu , Lirui Zhao , Zhiqian Li , Kaipeng Zhang , Peng Gao , Yu Qiao , Ping Luo

LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid

Large language models (LLMs) have shown immense potential across various domains, but their high memory requirements and inference costs remain critical challenges for deployment. Post-training quantization (PTQ) has emerged as a promising…

Machine Learning · Computer Science 2026-01-05 Tianyi Zhang , Anshumali Shrivastava

WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More

Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on…

Machine Learning · Computer Science 2024-02-21 Yuxuan Yue , Zhihang Yuan , Haojie Duanmu , Sifan Zhou , Jianlong Wu , Liqiang Nie

FlatQuant: Flatness Matters for LLM Quantization

Recently, quantization has been widely used for the compression and acceleration of large language models (LLMs). Due to the outliers in LLMs, it is crucial to flatten weights and activations to minimize quantization error with equally…

Computation and Language · Computer Science 2025-08-12 Yuxuan Sun , Ruikang Liu , Haoli Bai , Han Bao , Kang Zhao , Yuening Li , Jiaxin Hu , Xianzhi Yu , Lu Hou , Chun Yuan , Xin Jiang , Wulong Liu , Jun Yao

Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

Large Language Models (LLMs) stand out for their impressive performance in intricate language modeling tasks. However, their demanding computational and memory needs pose obstacles for broad use on edge devices. Quantization is then…

Machine Learning · Computer Science 2025-04-22 Xuan Shen , Peiyan Dong , Lei Lu , Zhenglun Kong , Zhengang Li , Ming Lin , Chao Wu , Yanzhi Wang

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment due to their substantial memory requirements. Furthermore, the latest generative models…

Machine Learning · Computer Science 2023-08-22 Young Jin Kim , Rawn Henry , Raffy Fahim , Hany Hassan Awadalla

LLMEasyQuant: Scalable Quantization for Parallel and Distributed LLM Inference

As large language models (LLMs) grow in size and deployment scale, quantization has become an essential technique for reducing memory footprint and improving inference efficiency. However, existing quantization toolkits often lack…

Machine Learning · Computer Science 2025-12-01 Dong Liu , Yanxuan Yu

WINDQuant: Weight-Informed Neural Decision-Making for Global Mixed-Precision LLM Quantization

Quantization is an effective approach to reduce the memory footprint and inference cost of large language models (LLMs), yet maintaining performance in the ultra-low-bit regime remains challenging. Existing post-training methods often…

Machine Learning · Computer Science 2026-05-27 Phong Nam Huu Nguyen , Khoi M. Le , Cong-Duy T Nguyen , Anh Tuan Luu , Thong Thanh Nguyen , Tho Quan

FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization

The rapid advancement of large language models (LLMs) has exacerbated the memory bottleneck due to the widening gap between model parameter scaling and hardware capabilities. While post-training quantization techniques effectively reduce…

Machine Learning · Computer Science 2025-10-22 Fangxin Liu , Zongwu Wang , JinHong Xia , Junping Zhao , Shouren Zhao , Jinjin Li , Jian Liu , Li Jiang , Haibing Guan

SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM

Large language models (LLMs) have shown remarkable capabilities in various tasks. However their huge model size and the consequent demand for computational and memory resources also pose challenges to model deployment. Currently, 4-bit…

Machine Learning · Computer Science 2023-12-08 Jiayi Pan , Chengcan Wang , Kaifu Zheng , Yangguang Li , Zhenyu Wang , Bin Feng

EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs

Large language models (LLMs) have proven to be very superior to conventional methods in various tasks. However, their expensive computations and high memory requirements are prohibitive for deployment. Model quantization is an effective…

Artificial Intelligence · Computer Science 2024-03-06 Hanlin Tang , Yifu Sun , Decheng Wu , Kai Liu , Jianchen Zhu , Zhanhui Kang

D$^2$Quant: Accurate Low-bit Post-Training Weight Quantization for LLMs

Large language models (LLMs) deliver strong performance, but their high compute and memory costs make deployment difficult in resource-constrained scenarios. Weight-only post-training quantization (PTQ) is appealing, as it reduces memory…

Machine Learning · Computer Science 2026-02-09 Xianglong Yan , ChengZhu Bao , Zhiteng Li , Tianao Zhang , Shaoqiu Zhang , Ruobing Xie , Samm Sun , Yulun Zhang

NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models

Weight-only quantization has become a standard approach for efficiently serving large language models (LLMs). However, existing methods fail to efficiently compress models to binary (1-bit) levels, as they either require large amounts of…

Machine Learning · Computer Science 2026-05-19 Hyochan Chong , Dongkyu Kim , Changdong Kim , Minseop Choi

SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization

The emergence of accurate open large language models (LLMs) has sparked a push for advanced quantization techniques to enable efficient deployment on end-user devices. In this paper, we revisit the challenge of extreme LLM compression --…

Machine Learning · Computer Science 2026-04-09 Zhixiong Zhao , Fangxin Liu , Junjie Wang , Chenyang Guan , Zongwu Wang , Li Jiang , Haibing Guan

SpinQuant: LLM quantization with learned rotations

Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs), but may lead to large quantization errors when…

Machine Learning · Computer Science 2025-02-21 Zechun Liu , Changsheng Zhao , Igor Fedorov , Bilge Soran , Dhruv Choudhary , Raghuraman Krishnamoorthi , Vikas Chandra , Yuandong Tian , Tijmen Blankevoort

F-BFQ: Flexible Block Floating-Point Quantization Accelerator for LLMs

Large Language Models (LLMs) have become increasingly prominent for daily tasks, from improving sound-totext translation to generating additional frames for the latest video games. With the help of LLM inference frameworks, such as…

Hardware Architecture · Computer Science 2025-10-16 Jude Haris , José Cano

LittleBit: Ultra Low-Bit Quantization via Latent Factorization

The deployment of large language models (LLMs) is frequently hindered by prohibitive memory and computational requirements. While quantization mitigates these bottlenecks, maintaining model fidelity in the sub-1-bit regime remains a…

Machine Learning · Computer Science 2026-02-06 Banseok Lee , Dongkyu Kim , Youngcheon You , Youngmin Kim

XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks. However, their extensive memory requirements, particularly due to KV cache growth during long-text understanding and…

Computation and Language · Computer Science 2025-10-14 Haoqi Yang , Yao Yao , Zuchao Li , Baoyuan Qi , Guoming Liu , Hai Zhao

End-to-End On-Device Quantization-Aware Training for LLMs at Inference Cost

Quantization is an effective technique to reduce the deployment cost of large language models (LLMs), and post-training quantization (PTQ) has been widely studied due to its efficiency. However, existing PTQ methods are limited by their…

Machine Learning · Computer Science 2025-09-30 Qitao Tan , Xiaoying Song , Jin Lu , Guoming Li , Jun Liu , Lingzi Hong , Caiwen Ding , Jundong Li , Xiaoming Zhai , Shaoyi Huang , Wei Niu , Geng Yuan