Related papers: Dual Grained Quantization: Efficient Fine-Grained …

FPTQ: Fine-grained Post-Training Quantization for Large Language Models

In the era of large-scale language models, the substantial parameter size poses significant challenges for deployment. Being a prevalent compression technique, quantization has emerged as the mainstream practice to tackle this issue, which…

Computation and Language · Computer Science 2023-08-31 Qingyuan Li , Yifan Zhang , Liang Li , Peng Yao , Bo Zhang , Xiangxiang Chu , Yerui Sun , Li Du , Yuchen Xie

Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference

Deep neural networks have achieved state-of-the-art results in a wide range of applications, from natural language processing and computer vision to speech recognition. However, as tasks become increasingly complex, model sizes continue to…

Computer Vision and Pattern Recognition · Computer Science 2025-05-21 Tomer Gafni , Asaf Karnieli , Yair Hanani

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

Large Language Models (LLMs) are proficient in natural language processing tasks, but their deployment is often restricted by extensive parameter sizes and computational demands. This paper focuses on post-training quantization (PTQ) in…

Computation and Language · Computer Science 2024-07-19 Janghwan Lee , Minsoo Kim , Seungcheol Baek , Seok Joong Hwang , Wonyong Sung , Jungwook Choi

D$^2$Quant: Accurate Low-bit Post-Training Weight Quantization for LLMs

Large language models (LLMs) deliver strong performance, but their high compute and memory costs make deployment difficult in resource-constrained scenarios. Weight-only post-training quantization (PTQ) is appealing, as it reduces memory…

Machine Learning · Computer Science 2026-02-09 Xianglong Yan , ChengZhu Bao , Zhiteng Li , Tianao Zhang , Shaoqiu Zhang , Ruobing Xie , Samm Sun , Yulun Zhang

AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

Quantization is a key method for reducing the GPU memory requirement of training large language models (LLMs). Yet, current approaches are ineffective for 4-bit activations and 8-bit gradients, which would easily cause slow convergence or…

Computation and Language · Computer Science 2026-05-12 Wenxiang Lin , Juntao Huang , Luhan Zhang , Laili Li , Xiang Bao , Mengyang Zhang , Bing Wang , Shaohuai Shi

GWQ: Gradient-Aware Weight Quantization for Large Language Models

Large language models (LLMs) show impressive performance in solving complex language tasks. However, its large number of parameters presents significant challenges for the deployment. So, compressing LLMs to low bits can enable to deploy on…

Machine Learning · Computer Science 2025-05-30 Yihua Shao , Yan Gu , Siyu Chen , Haiyang Liu , Zixian Zhu , Zijian Ling , Minxi Yan , Ziyang Yan , Chenyu Zhang , Michele Magno , Haotong Qin , Yan Wang , Jingcai Guo , Ling Shao , Hao Tang

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment due to their substantial memory requirements. Furthermore, the latest generative models…

Machine Learning · Computer Science 2023-08-22 Young Jin Kim , Rawn Henry , Raffy Fahim , Hany Hassan Awadalla

Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

Large Language Models (LLMs) stand out for their impressive performance in intricate language modeling tasks. However, their demanding computational and memory needs pose obstacles for broad use on edge devices. Quantization is then…

Machine Learning · Computer Science 2025-04-22 Xuan Shen , Peiyan Dong , Lei Lu , Zhenglun Kong , Zhengang Li , Ming Lin , Chao Wu , Yanzhi Wang

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques…

Computation and Language · Computer Science 2025-05-02 Yujun Lin , Haotian Tang , Shang Yang , Zhekai Zhang , Guangxuan Xiao , Chuang Gan , Song Han

FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference

Quantization is a powerful tool to improve large language model (LLM) inference efficiency by utilizing more energy-efficient low-precision datapaths and reducing memory footprint. However, accurately quantizing LLM weights and activations…

Hardware Architecture · Computer Science 2025-04-22 Coleman Hooper , Charbel Sakr , Ben Keller , Rangharajan Venkatesan , Kurt Keutzer , Sophia Shao , Brucek Khailany

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization

Large language models (LLMs) have demonstrated impressive abilities in various domains while the inference cost is expensive. Many previous studies exploit quantization methods to reduce LLM inference cost by reducing latency and memory…

Machine Learning · Computer Science 2024-11-12 Jinhao Li , Jiaming Xu , Shiyao Li , Shan Huang , Jun Liu , Yaoxiu Lian , Guohao Dai

COMET: Towards Partical W4A4KV4 LLMs Serving

Quantization is a widely-used compression technology to reduce the overhead of serving large language models (LLMs) on terminal devices and in cloud data centers. However, prevalent quantization methods, such as 8-bit weight-activation or…

Hardware Architecture · Computer Science 2024-10-17 Lian Liu , Haimeng Ren , Long Cheng , Zhaohui Xu , Yudong Pan , Mengdi Wang , Xiaowei Li , Yinhe Han , Ying Wang

QQQ: Quality Quattuor-Bit Quantization for Large Language Models

Quantization is a proven effective method for compressing large language models. Although popular techniques like W8A8 and W4A16 effectively maintain model performance, they often fail to concurrently speed up the prefill and decoding…

Machine Learning · Computer Science 2024-08-01 Ying Zhang , Peng Zhang , Mincong Huang , Jingyang Xiang , Yujie Wang , Chao Wang , Yineng Zhang , Lei Yu , Chuan Liu , Wei Lin

Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques

This paper presents a comprehensive analysis of quantization techniques for optimizing Large Language Models (LLMs), specifically focusing on Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). Through empirical…

Machine Learning · Computer Science 2024-11-12 Jahid Hasan

Achieving binary weight and activation for LLMs using Post-Training Quantization

Quantizing large language models (LLMs) to 1-bit precision significantly reduces computational costs, but existing quantization techniques suffer from noticeable performance degradation when using weight and activation precisions below 4…

Machine Learning · Computer Science 2025-07-01 Siqing Song , Chuang Wang , Ruiqi Wang , Yi Yang , Xu-Yao Zhang

FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design

Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs, restricting their practical deployment. While existing INT4/INT8 quantization reduces these costs, they often degrade…

Machine Learning · Computer Science 2025-11-04 Hao Zhang , Aining Jia , Weifeng Bu , Yushu Cai , Kai Sheng , Hao Chen , Xin He

LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation

Deploying large language models (LLMs) in resource-constrained environments is hindered by heavy computational and memory requirements. We present LBLLM, a lightweight binarization framework that achieves effective W(1+1)A4 quantization…

Machine Learning · Computer Science 2026-04-22 Siqing Song , Chuang Wang , Yong Lang , Yi Yang , Xu-Yao Zhang

SqueezeLLM: Dense-and-Sparse Quantization

Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This…

Computation and Language · Computer Science 2024-06-06 Sehoon Kim , Coleman Hooper , Amir Gholami , Zhen Dong , Xiuyu Li , Sheng Shen , Michael W. Mahoney , Kurt Keutzer

FineQ: Software-Hardware Co-Design for Low-Bit Fine-Grained Mixed-Precision Quantization of LLMs

Large language models (LLMs) have significantly advanced the natural language processing paradigm but impose substantial demands on memory and computational resources. Quantization is one of the most effective ways to reduce memory…

Machine Learning · Computer Science 2025-04-29 Xilong Xie , Liang Wang , Limin Xiao , Meng Han , Lin Sun , Shuai Zheng , Xiangrong Xu

LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid

Large language models (LLMs) have shown immense potential across various domains, but their high memory requirements and inference costs remain critical challenges for deployment. Post-training quantization (PTQ) has emerged as a promising…

Machine Learning · Computer Science 2026-01-05 Tianyi Zhang , Anshumali Shrivastava