Related papers: Layer-Wise Quantization: A Pragmatic and Effective…

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

Increasing the number of parameters in large language models (LLMs) usually improves performance in downstream tasks but raises compute and memory costs, making deployment difficult in resource-limited settings. Quantization techniques,…

Computation and Language · Computer Science 2024-06-07 Renren Jin , Jiangcun Du , Wuwei Huang , Wei Liu , Jian Luan , Bin Wang , Deyi Xiong

A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms

Large language models (LLMs) have achieved remarkable advancements in natural language processing, showcasing exceptional performance across various tasks. However, the expensive memory and computational requirements present significant…

Artificial Intelligence · Computer Science 2025-11-13 Ruihao Gong , Yifu Ding , Zining Wang , Chengtao Lv , Xingyu Zheng , Jinyang Du , Haotong Qin , Jinyang Guo , Michele Magno , Xianglong Liu

Towards Understanding Best Practices for Quantization of Vision-Language Models

Large language models (LLMs) deliver impressive results for a variety of tasks, but state-of-the-art systems require fast GPUs with large amounts of memory. To reduce both the memory and latency of these systems, practitioners quantize…

Computer Vision and Pattern Recognition · Computer Science 2026-01-22 Gautom Das , Vincent La , Ethan Lau , Abhinav Shrivastava , Matthew Gwilliam

Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

The inference of Large language models (LLMs) requires immense computation and memory resources. To curtail these costs, quantisation has merged as a promising solution, but existing LLM quantisation mainly focuses on 8-bit. In this work,…

Machine Learning · Computer Science 2024-03-15 Cheng Zhang , Jianyi Cheng , Ilia Shumailov , George A. Constantinides , Yiren Zhao

SliderQuant: Accurate Post-Training Quantization for LLMs

In this paper, we address post-training quantization (PTQ) for large language models (LLMs) from an overlooked perspective: given a pre-trained high-precision LLM, the predominant sequential quantization framework treats different layers…

Artificial Intelligence · Computer Science 2026-03-27 Shigeng Wang , Chao Li , Yangyuxuan Kang , Jiawei Fan , Zhonghong Ou , Anbang Yao

Mix-QViT: Mixed-Precision Vision Transformer Quantization Driven by Layer Importance and Quantization Sensitivity

In this paper, we propose Mix-QViT, an explainability-driven MPQ framework that systematically allocates bit-widths to each layer based on two criteria: layer importance, assessed via Layer-wise Relevance Propagation (LRP), which identifies…

Computer Vision and Pattern Recognition · Computer Science 2025-01-14 Navin Ranjan , Andreas Savakis

A Systematic Evaluation of On-Device LLMs: Quantization, Performance, and Resources

Deploying Large Language Models (LLMs) on edge devices enhances privacy but faces performance hurdles due to limited resources. We introduce a systematic methodology to evaluate on-device LLMs, balancing capability, efficiency, and resource…

Machine Learning · Computer Science 2026-03-17 Qingyu Song , Rui Liu , Wei Lin , Peiyu Liao , Wenqian Zhao , Yiwen Wang , Shoubo Hu , Yining Jiang , Mochun Long , Hui-Ling Zhen , Ning Jiang , Mingxuan Yuan , Qiao Xiang , Hong Xu

Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity

Large language models (LLMs) have revolutionized natural language processing. Understanding their internal mechanisms is crucial for developing more interpretable and optimized architectures. Mechanistic interpretability has led to the…

Machine Learning · Computer Science 2026-05-15 Cristian Hinostroza , Rodrigo Toro Icarte , Christ Devia , Andres Carvallo De Ferari , Eugenio Herrera-Berg , Denis Parra , Jorge F Silva

A Comprehensive Study on Quantization Techniques for Large Language Models

Large Language Models (LLMs) have been extensively researched and used in both academia and industry since the rise in popularity of the Transformer model, which demonstrates excellent performance in AI. However, the computational demands…

Machine Learning · Computer Science 2024-11-06 Jiedong Lang , Zhehao Guo , Shuyu Huang

Low-Rank Quantization-Aware Training for LLMs

Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and…

Machine Learning · Computer Science 2024-09-04 Yelysei Bondarenko , Riccardo Del Chiaro , Markus Nagel

LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models

Large Language Models (LLMs) with multimodal capabilities have revolutionized vision-language tasks, but their deployment often requires huge memory and computational resources. While post-training quantization (PTQ) has successfully…

Computer Vision and Pattern Recognition · Computer Science 2025-10-28 Shubhang Bhatnagar , Andy Xu , Kar-Han Tan , Narendra Ahuja

GWQ: Gradient-Aware Weight Quantization for Large Language Models

Large language models (LLMs) show impressive performance in solving complex language tasks. However, its large number of parameters presents significant challenges for the deployment. So, compressing LLMs to low bits can enable to deploy on…

Machine Learning · Computer Science 2025-05-30 Yihua Shao , Yan Gu , Siyu Chen , Haiyang Liu , Zixian Zhu , Zijian Ling , Minxi Yan , Ziyang Yan , Chenyu Zhang , Michele Magno , Haotong Qin , Yan Wang , Jingcai Guo , Ling Shao , Hao Tang

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

Several post-training quantization methods have been applied to large language models (LLMs), and have been shown to perform well down to 8-bits. We find that these methods break down at lower bit precision, and investigate quantization…

Computation and Language · Computer Science 2023-05-30 Zechun Liu , Barlas Oguz , Changsheng Zhao , Ernie Chang , Pierre Stock , Yashar Mehdad , Yangyang Shi , Raghuraman Krishnamoorthi , Vikas Chandra

Why Do Some Inputs Break Low-Bit LLM Quantization?

Low-bit weight-only quantization significantly reduces the memory footprint of large language models (LLMs), but disproportionately affects certain examples. We analyze diverse 3-4 bit methods on LLMs ranging from 7B-70B in size and find…

Machine Learning · Computer Science 2025-09-25 Ting-Yun Chang , Muru Zhang , Jesse Thomason , Robin Jia

Exploring Neural Networks Quantization via Layer-Wise Quantization Analysis

Quantization is an essential step in the efficient deployment of deep learning models and as such is an increasingly popular research topic. An important practical aspect that is not addressed in the current literature is how to analyze and…

Machine Learning · Computer Science 2020-12-16 Shachar Gluska , Mark Grobman

Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study

Despite the superior performance, Large Language Models~(LLMs) require significant computational resources for deployment and use. To overcome this issue, quantization methods have been widely applied to reduce the memory footprint of LLMs…

Computation and Language · Computer Science 2023-07-27 Peiyu Liu , Zikang Liu , Ze-Feng Gao , Dawei Gao , Wayne Xin Zhao , Yaliang Li , Bolin Ding , Ji-Rong Wen

Evaluating the Impact of Post-Training Quantization on Large Language Models for Code Generation

Large Language Models (LLMs) have shown an impressive capability in code generation. The LLM effectiveness generally increases with its size: The higher the number of LLM's trainable parameters the better its ability to implement code.…

Software Engineering · Computer Science 2026-01-28 Alessandro Giagnorio , Antonio Mastropaolo , Saima Afrin , Massimiliano Di Penta , Gabriele Bavota

RSQ: Learning from Important Tokens Leads to Better Quantized LLMs

Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining. Previous methods typically quantize the weights of each layer by "uniformly" optimizing the layer reconstruction loss across…

Machine Learning · Computer Science 2025-03-04 Yi-Lin Sung , Prateek Yadav , Jialu Li , Jaehong Yoon , Mohit Bansal

LLMPi: Optimizing LLMs for High-Throughput on Raspberry Pi

Deploying Large Language Models (LLMs) on resource-constrained edge devices like the Raspberry Pi presents challenges in computational efficiency, power consumption, and response latency. This paper explores quantization-based optimization…

Machine Learning · Computer Science 2025-04-04 Mahsa Ardakani , Jinendra Malekar , Ramtin Zand

LRP-QViT: Mixed-Precision Vision Transformer Quantization via Layer-wise Relevance Propagation

Vision transformers (ViTs) have demonstrated remarkable performance across various visual tasks. However, ViT models suffer from substantial computational and memory requirements, making it challenging to deploy them on resource-constrained…

Computer Vision and Pattern Recognition · Computer Science 2024-01-23 Navin Ranjan , Andreas Savakis