Related papers: SDQ: Sparse Decomposed Quantization for LLM Infere…

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

Increasing the number of parameters in large language models (LLMs) usually improves performance in downstream tasks but raises compute and memory costs, making deployment difficult in resource-limited settings. Quantization techniques,…

Computation and Language · Computer Science 2024-06-07 Renren Jin , Jiangcun Du , Wuwei Huang , Wei Liu , Jian Luan , Bin Wang , Deyi Xiong

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities. By compressing such LLMs via quantization to 3-4 bits per parameter, they can fit into memory-limited devices such as laptops…

Computation and Language · Computer Science 2023-06-06 Tim Dettmers , Ruslan Svirschevski , Vage Egiazarian , Denis Kuznedelev , Elias Frantar , Saleh Ashkboos , Alexander Borzunov , Torsten Hoefler , Dan Alistarh

SPQ: An Ensemble Technique for Large Language Model Compression

This study presents an ensemble technique, SPQ (SVD-Pruning-Quantization), for large language model (LLM) compression that combines variance-retained singular value decomposition (SVD), activation-based pruning, and post-training linear…

Computation and Language · Computer Science 2026-02-23 Jiamin Yao , Eren Gultepe

SQ-format: A Unified Sparse-Quantized Hardware-friendly Data Format for LLMs

Post-training quantization (PTQ) plays a crucial role in the democratization of large language models (LLMs). However, existing low-bit quantization and sparsification techniques are difficult to balance accuracy and efficiency due to the…

Computation and Language · Computer Science 2025-12-08 Ruixuan Huang , Hao Zeng , Hantao Huang , Jinyuan Shi , Minghui Yu , Ian En-Hsu Yen , Shuai Wang

SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size

Large language models (LLMs) face significant computational and memory challenges, making extremely low-bit quantization crucial for their efficient deployment. In this work, we introduce SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of…

Machine Learning · Computer Science 2025-10-07 Junhao Xia , Ming Zhao , Limin Xiao , Xiujun Zhang

When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models

Large language models (LLMs) exhibit excellent performance in various tasks. However, the memory requirements of LLMs present a great challenge when deploying on memory-limited devices, even for quantized LLMs. This paper introduces a…

Computation and Language · Computer Science 2025-02-24 Weilan Wang , Yu Mao , Dongdong Tang , Hongchao Du , Nan Guan , Chun Jason Xue

Enabling Dynamic Sparsity in Quantized LLM Inference

Deploying large language models (LLMs) on end-user devices is gaining importance due to benefits in responsiveness, privacy, and operational cost. Yet the limited memory and compute capability of mobile and desktop GPUs make efficient…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-07 Rongxiang Wang , Kangyuan Shu , Felix Xiaozhu Lin

Dynamic Stashing Quantization for Efficient Transformer Training

Large Language Models (LLMs) have demonstrated impressive performance on a range of Natural Language Processing (NLP) tasks. Unfortunately, the immense amount of computations and memory accesses required for LLM training makes them…

Machine Learning · Computer Science 2023-03-10 Guo Yang , Daniel Lo , Robert Mullins , Yiren Zhao

LLM Compression: How Far Can We Go in Balancing Size and Performance?

Quantization is an essential and popular technique for improving the accessibility of large language models (LLMs) by reducing memory usage and computational costs while maintaining performance. In this study, we apply 4-bit Group Scaling…

Computation and Language · Computer Science 2025-08-18 Sahil Sk , Debasish Dhal , Sonal Khosla , Sk Shahid , Sambit Shekhar , Akash Dhaka , Shantipriya Parida , Dilip K. Prasad , Ondřej Bojar

From Quarter to All: Accelerating Speculative LLM Decoding via Floating-Point Exponent Remapping and Parameter Sharing

Large language models achieve impressive performance across diverse tasks but exhibit high inference latency due to their large parameter sizes. While quantization reduces model size, it often leads to performance degradation compared to…

Hardware Architecture · Computer Science 2025-10-22 Yushu Zhao , Yubin Qin , Yang Wang , Xiaolong Yang , Huiming Han , Shaojun Wei , Yang Hu , Shouyi Yin

QSLM: A Performance- and Memory-aware Quantization Framework with Tiered Search Strategy for Spike-driven Language Models

Large Language Models (LLMs) have been emerging as prominent AI models for solving many natural language tasks due to their high performance (e.g., accuracy) and capabilities in generating high-quality responses to the given inputs.…

Neural and Evolutionary Computing · Computer Science 2026-04-22 Rachmad Vidya Wicaksana Putra , Pasindu Wickramasinghe , Muhammad Shafique

Self-Distilled Quantization: Achieving High Compression Rates in Transformer-Based Language Models

We investigate the effects of post-training quantization and quantization-aware training on the generalization of Transformer language models. We present a new method called self-distilled quantization (SDQ) that minimizes accumulative…

Computation and Language · Computer Science 2023-07-13 James O' Neill , Sourav Dutta

WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More

Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on…

Machine Learning · Computer Science 2024-02-21 Yuxuan Yue , Zhihang Yuan , Haojie Duanmu , Sifan Zhou , Jianlong Wu , Liqiang Nie

SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models

Large language models (LLMs) can now handle longer sequences of tokens, enabling complex tasks like book understanding and generating lengthy novels. However, the key-value (KV) cache required for LLMs consumes substantial memory as context…

Machine Learning · Computer Science 2024-11-13 Haojie Duanmu , Zhihang Yuan , Xiuhong Li , Jiangfei Duan , Xingcheng Zhang , Dahua Lin

LCQ: Low-Rank Codebook based Quantization for Large Language Models

Large language models~(LLMs) have recently demonstrated promising performance in many tasks. However, the high storage and computational cost of LLMs has become a challenge for deploying LLMs. Weight quantization has been widely used for…

Machine Learning · Computer Science 2025-02-11 Wen-Pu Cai , Ming-Yang Li , Wu-Jun Li

SiLQ: Simple Large Language Model Quantization-Aware Training

Large language models can be quantized to reduce inference time latency, model size, and energy consumption, thereby delivering a better user experience at lower cost. A challenge exists to deliver quantized models with minimal loss of…

Machine Learning · Computer Science 2025-07-24 Steven K. Esser , Jeffrey L. McKinstry , Deepika Bablani , Rathinakumar Appuswamy , Dharmendra S. Modha

SEPTQ: A Simple and Effective Post-Training Quantization Paradigm for Large Language Models

Large language models (LLMs) have shown remarkable performance in various domains, but they are constrained by massive computational and storage costs. Quantization, an effective technique for compressing models to fit resource-limited…

Computation and Language · Computer Science 2026-04-14 Han Liu , Haotian Gao , Xiaotong Zhang , Changya Li , Feng Zhang , Wei Wang , Fenglong Ma , Hong Yu

QSpec: Speculative Decoding with Complementary Quantization Schemes

Quantization is widely adopted to accelerate inference and reduce memory consumption in large language models (LLMs). While activation-weight joint quantization enables efficient low-precision decoding, it suffers from substantial…

Machine Learning · Computer Science 2025-10-03 Juntao Zhao , Wenhao Lu , Sheng Wang , Lingpeng Kong , Chuan Wu

Squat: Quant Small Language Models on the Edge

A growing trend has emerged in designing high-quality Small Language Models (SLMs) with a few million parameters. This trend is driven by the increasing concerns over cloud costs, privacy, and latency. Considering that full parameter…

Machine Learning · Computer Science 2025-07-03 Xuan Shen , Peiyan Dong , Zhenglun Kong , Yifan Gong , Changdi Yang , Zhaoyang Han , Yanyue Xie , Lei Lu , Cheng Lyu , Chao Wu , Yanzhi Wang , Pu Zhao

SqueezeLLM: Dense-and-Sparse Quantization

Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This…

Computation and Language · Computer Science 2024-06-06 Sehoon Kim , Coleman Hooper , Amir Gholami , Zhen Dong , Xiuyu Li , Sheng Shen , Michael W. Mahoney , Kurt Keutzer