Related papers: LCQ: Low-Rank Codebook based Quantization for Larg…

LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices

With the commercialization of large language models (LLMs), weight-activation quantization has emerged to compress and accelerate LLMs, achieving high throughput while reducing inference costs. However, existing post-training quantization…

Machine Learning · Computer Science 2025-02-11 Jung Hyun Lee , Jeonghoon Kim , June Yong Yang , Se Jung Kwon , Eunho Yang , Kang Min Yoo , Dongsoo Lee

Foundations of Large Language Model Compression -- Part 1: Weight Quantization

In recent years, compression of large language models (LLMs) has emerged as an important problem to enable language model deployment on resource-constrained devices, reduce computational costs, and mitigate the environmental footprint of…

Machine Learning · Computer Science 2024-10-04 Sean I. Young

WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More

Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on…

Machine Learning · Computer Science 2024-02-21 Yuxuan Yue , Zhihang Yuan , Haojie Duanmu , Sifan Zhou , Jianlong Wu , Liqiang Nie

Channel-Wise Mixed-Precision Quantization for Large Language Models

Large Language Models (LLMs) have demonstrated remarkable success across a wide range of language tasks, but their deployment on edge devices remains challenging due to the substantial memory requirements imposed by their large parameter…

Computation and Language · Computer Science 2025-02-05 Zihan Chen , Bike Xie , Jundong Li , Cong Shen

ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation

Post-training quantization (PTQ) has emerged as a promising technique for mitigating memory consumption and computational costs in large language models (LLMs). However, a systematic examination of various quantization schemes, model…

Machine Learning · Computer Science 2023-05-29 Zhewei Yao , Xiaoxia Wu , Cheng Li , Stephen Youn , Yuxiong He

When Quantization Affects Confidence of Large Language Models?

Recent studies introduced effective compression techniques for Large Language Models (LLMs) via post-training quantization or low-bit weight representation. Although quantized weights offer storage efficiency and allow for faster inference,…

Computation and Language · Computer Science 2024-05-02 Irina Proskurina , Luc Brun , Guillaume Metzler , Julien Velcin

CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

Parameter quantization for Large Language Models (LLMs) has attracted increasing attentions recently in reducing memory costs and improving computational efficiency. Early approaches have been widely adopted. However, the existing methods…

Machine Learning · Computer Science 2024-06-04 Haoyu Wang , Bei Liu , Hang Shao , Bo Xiao , Ke Zeng , Guanglu Wan , Yanmin Qian

Is Quantization a Deal-breaker? Empirical Insights from Large Code Models

The growing scale of large language models (LLMs) not only demands extensive computational resources but also raises environmental concerns due to their increasing carbon footprint. Model quantization emerges as an effective approach that…

Software Engineering · Computer Science 2025-07-15 Saima Afrin , Bowen Xu , Antonio Mastropaolo

Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models

Large Language Models (LLMs) deliver strong performance across a wide range of NLP tasks, but their massive sizes hinder deployment on resource-constrained devices. To reduce their computational and memory burden, various compression…

Machine Learning · Computer Science 2026-05-18 Dung Anh Hoang , Cuong Pham , Cuong Nguyen , Trung le , Jianfei Cai , Thanh-Toan Do

LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit

Recent advancements in large language models (LLMs) are propelling us toward artificial general intelligence with their remarkable emergent abilities and reasoning capabilities. However, the substantial computational and memory requirements…

Machine Learning · Computer Science 2024-10-10 Ruihao Gong , Yang Yong , Shiqiao Gu , Yushi Huang , Chengtao Lv , Yunchen Zhang , Xianglong Liu , Dacheng Tao

Low-Rank Quantization-Aware Training for LLMs

Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and…

Machine Learning · Computer Science 2024-09-04 Yelysei Bondarenko , Riccardo Del Chiaro , Markus Nagel

NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models

Weight-only quantization has become a standard approach for efficiently serving large language models (LLMs). However, existing methods fail to efficiently compress models to binary (1-bit) levels, as they either require large amounts of…

Machine Learning · Computer Science 2026-05-19 Hyochan Chong , Dongkyu Kim , Changdong Kim , Minseop Choi

DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models

Improving the efficiency of inference in Large Language Models (LLMs) is a critical area of research. Post-training Quantization (PTQ) is a popular technique, but it often faces challenges at low-bit levels, particularly in downstream…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Wenjin Ke , Zhe Li , Dong Li , Lu Tian , Emad Barsoum

ELUTQ: Optimizing Quantization Accuracy under LUT-Based Computation for Edge LLMs

Weight quantization effectively reduces memory consumption and enable the deployment of Large Language Models on edge devices, yet existing hardware-friendly methods often rely on uniform quantization, which suffers from poor…

Machine Learning · Computer Science 2026-02-03 Xin Nie , Liang Dong , Haicheng Zhang , Jiawang Xiao , G. Sun

GWQ: Gradient-Aware Weight Quantization for Large Language Models

Large language models (LLMs) show impressive performance in solving complex language tasks. However, its large number of parameters presents significant challenges for the deployment. So, compressing LLMs to low bits can enable to deploy on…

Machine Learning · Computer Science 2025-05-30 Yihua Shao , Yan Gu , Siyu Chen , Haiyang Liu , Zixian Zhu , Zijian Ling , Minxi Yan , Ziyang Yan , Chenyu Zhang , Michele Magno , Haotong Qin , Yan Wang , Jingcai Guo , Ling Shao , Hao Tang

PocketLLM: Ultimate Compression of Large Language Models via Meta Networks

As Large Language Models (LLMs) continue to grow in size, storing and transmitting them on edge devices becomes increasingly challenging. Traditional methods like quantization and pruning struggle to achieve extreme compression of LLMs…

Machine Learning · Computer Science 2025-11-25 Ye Tian , Chengcheng Wang , Jing Han , Yehui Tang , Kai Han

QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models

Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one…

Machine Learning · Computer Science 2023-10-10 Yuhui Xu , Lingxi Xie , Xiaotao Gu , Xin Chen , Heng Chang , Hengheng Zhang , Zhengsu Chen , Xiaopeng Zhang , Qi Tian

A Comprehensive Study on Quantization Techniques for Large Language Models

Large Language Models (LLMs) have been extensively researched and used in both academia and industry since the rise in popularity of the Transformer model, which demonstrates excellent performance in AI. However, the computational demands…

Machine Learning · Computer Science 2024-11-06 Jiedong Lang , Zhehao Guo , Shuyu Huang

When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models

Large language models (LLMs) exhibit excellent performance in various tasks. However, the memory requirements of LLMs present a great challenge when deploying on memory-limited devices, even for quantized LLMs. This paper introduces a…

Computation and Language · Computer Science 2025-02-24 Weilan Wang , Yu Mao , Dongdong Tang , Hongchao Du , Nan Guan , Chun Jason Xue

UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs

Deploying large language models (LLMs) on mobile platforms faces significant challenges due to the limited memory and shared computational resources of the device. Resource availability may be an issue as it is directly impacted by the…

Machine Learning · Computer Science 2026-02-27 Hung-Yueh Chiang , Chi-Chih Chang , Yu-Chen Lu , Chien-Yu Lin , Kai-Chiang Wu , Mohamed S. Abdelfattah , Diana Marculescu