Related papers: Dynamic Stashing Quantization for Efficient Transf…

SASQ: Static Activation Scaling for Quantization-Aware Training in Large Language Models

Large language models (LLMs) excel at natural language tasks but face deployment challenges due to their growing size outpacing GPU memory advancements. Model quantization mitigates this issue by lowering weight and activation precision,…

Computation and Language · Computer Science 2025-12-17 Shizhuo Mao , Song Chen , Yi Kang

Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate

Large Language Models (LLMs) are widely used across many domains, but their scale makes deployment challenging. Post-Training Quantization (PTQ) reduces memory footprint without retraining by leveraging a small calibration set. Recent…

Machine Learning · Computer Science 2026-04-16 Jaemin Kim , Sungkyun Kim , Junyeol Lee , Jiwon Seo

SDQ: Sparse Decomposed Quantization for LLM Inference

Recently, large language models (LLMs) have shown surprising performance in task-specific workloads as well as general tasks with the given prompts. However, to achieve unprecedented performance, recent LLMs use billions to trillions of…

Machine Learning · Computer Science 2024-06-21 Geonhwa Jeong , Po-An Tsai , Stephen W. Keckler , Tushar Krishna

DAQ: Density-Aware Post-Training Weight-Only Quantization For LLMs

Large language models (LLMs) excel in various tasks but face deployment challenges due to hardware constraints. We propose density-aware post-training weight-only quantization (DAQ), which has two stages: 1) density-centric alignment, which…

Machine Learning · Computer Science 2024-10-18 Yingsong Luo , Ling Chen

QSLM: A Performance- and Memory-aware Quantization Framework with Tiered Search Strategy for Spike-driven Language Models

Large Language Models (LLMs) have been emerging as prominent AI models for solving many natural language tasks due to their high performance (e.g., accuracy) and capabilities in generating high-quality responses to the given inputs.…

Neural and Evolutionary Computing · Computer Science 2026-04-22 Rachmad Vidya Wicaksana Putra , Pasindu Wickramasinghe , Muhammad Shafique

Low-Rank Quantization-Aware Training for LLMs

Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and…

Machine Learning · Computer Science 2024-09-04 Yelysei Bondarenko , Riccardo Del Chiaro , Markus Nagel

MSQ: Memory-Efficient Bit Sparsification Quantization

As deep neural networks (DNNs) see increased deployment on mobile and edge devices, optimizing model efficiency has become crucial. Mixed-precision quantization is widely favored, as it offers a superior balance between efficiency and…

Machine Learning · Computer Science 2025-07-31 Seokho Han , Seoyeon Yoon , Jinhee Kim , Dongwei Wang , Kang Eun Jeon , Huanrui Yang , Jong Hwan Ko

DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models

Improving the efficiency of inference in Large Language Models (LLMs) is a critical area of research. Post-training Quantization (PTQ) is a popular technique, but it often faces challenges at low-bit levels, particularly in downstream…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Wenjin Ke , Zhe Li , Dong Li , Lu Tian , Emad Barsoum

CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

Parameter quantization for Large Language Models (LLMs) has attracted increasing attentions recently in reducing memory costs and improving computational efficiency. Early approaches have been widely adopted. However, the existing methods…

Machine Learning · Computer Science 2024-06-04 Haoyu Wang , Bei Liu , Hang Shao , Bo Xiao , Ke Zeng , Guanglu Wan , Yanmin Qian

LCQ: Low-Rank Codebook based Quantization for Large Language Models

Large language models~(LLMs) have recently demonstrated promising performance in many tasks. However, the high storage and computational cost of LLMs has become a challenge for deploying LLMs. Weight quantization has been widely used for…

Machine Learning · Computer Science 2025-02-11 Wen-Pu Cai , Ming-Yang Li , Wu-Jun Li

Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks

Hardware-friendly network quantization (e.g., binary/uniform quantization) can efficiently accelerate the inference and meanwhile reduce memory consumption of the deep neural networks, which is crucial for model deployment on…

Computer Vision and Pattern Recognition · Computer Science 2019-08-15 Ruihao Gong , Xianglong Liu , Shenghu Jiang , Tianxiang Li , Peng Hu , Jiazhen Lin , Fengwei Yu , Junjie Yan

LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment

As Large Language Models (LLMs) demonstrate exceptional performance across various domains, deploying LLMs on edge devices has emerged as a new trend. Quantization techniques, which reduce the size and memory requirements of LLMs, are…

Computation and Language · Computer Science 2025-05-07 Binrui Zeng , Bin Ji , Xiaodong Liu , Jie Yu , Shasha Li , Jun Ma , Xiaopeng Li , Shangwen Wang , Xinran Hong , Yongtao Tang

Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models

Large Language Models (LLMs) deliver strong performance across a wide range of NLP tasks, but their massive sizes hinder deployment on resource-constrained devices. To reduce their computational and memory burden, various compression…

Machine Learning · Computer Science 2026-05-18 Dung Anh Hoang , Cuong Pham , Cuong Nguyen , Trung le , Jianfei Cai , Thanh-Toan Do

DLLMQuant: Quantizing Diffusion-based Large Language Models

Diffusion-based large language models (DLLMs) have shown promise for non-autoregressive text generation, but their deployment is constrained by large model sizes and heavy computational costs. Post-training quantization (PTQ), a widely used…

Computation and Language · Computer Science 2025-08-27 Chen Xu , Dawei Yang

Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs

Recent advances in diffusion large language models (dLLMs) have introduced a promising alternative to autoregressive (AR) LLMs for natural language generation tasks, leveraging full attention and denoising-based decoding strategies.…

Computation and Language · Computer Science 2026-03-17 Haokun Lin , Haobo Xu , Yichen Wu , Ziyu Guo , Renrui Zhang , Zhichao Lu , Ying Wei , Qingfu Zhang , Zhenan Sun

SiLQ: Simple Large Language Model Quantization-Aware Training

Large language models can be quantized to reduce inference time latency, model size, and energy consumption, thereby delivering a better user experience at lower cost. A challenge exists to deliver quantized models with minimal loss of…

Machine Learning · Computer Science 2025-07-24 Steven K. Esser , Jeffrey L. McKinstry , Deepika Bablani , Rathinakumar Appuswamy , Dharmendra S. Modha

LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices

With the commercialization of large language models (LLMs), weight-activation quantization has emerged to compress and accelerate LLMs, achieving high throughput while reducing inference costs. However, existing post-training quantization…

Machine Learning · Computer Science 2025-02-11 Jung Hyun Lee , Jeonghoon Kim , June Yong Yang , Se Jung Kwon , Eunho Yang , Kang Min Yoo , Dongsoo Lee

Efficient Quantization Strategies for Latent Diffusion Models

Latent Diffusion Models (LDMs) capture the dynamic evolution of latent variables over time, blending patterns and multimodality in a generative system. Despite the proficiency of LDM in various applications, such as text-to-image…

Computer Vision and Pattern Recognition · Computer Science 2023-12-12 Yuewei Yang , Xiaoliang Dai , Jialiang Wang , Peizhao Zhang , Hongbo Zhang

D$^2$Quant: Accurate Low-bit Post-Training Weight Quantization for LLMs

Large language models (LLMs) deliver strong performance, but their high compute and memory costs make deployment difficult in resource-constrained scenarios. Weight-only post-training quantization (PTQ) is appealing, as it reduces memory…

Machine Learning · Computer Science 2026-02-09 Xianglong Yan , ChengZhu Bao , Zhiteng Li , Tianao Zhang , Shaoqiu Zhang , Ruobing Xie , Samm Sun , Yulun Zhang

Squat: Quant Small Language Models on the Edge

A growing trend has emerged in designing high-quality Small Language Models (SLMs) with a few million parameters. This trend is driven by the increasing concerns over cloud costs, privacy, and latency. Considering that full parameter…

Machine Learning · Computer Science 2025-07-03 Xuan Shen , Peiyan Dong , Zhenglun Kong , Yifan Gong , Changdi Yang , Zhaoyang Han , Yanyue Xie , Lei Lu , Cheng Lyu , Chao Wu , Yanzhi Wang , Pu Zhao