Related papers: MobileQuant: Mobile-friendly Quantization for On-d…

Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

Large Language Models (LLMs) stand out for their impressive performance in intricate language modeling tasks. However, their demanding computational and memory needs pose obstacles for broad use on edge devices. Quantization is then…

Machine Learning · Computer Science 2025-04-22 Xuan Shen , Peiyan Dong , Lei Lu , Zhenglun Kong , Zhengang Li , Ming Lin , Chao Wu , Yanzhi Wang

A Systematic Evaluation of On-Device LLMs: Quantization, Performance, and Resources

Deploying Large Language Models (LLMs) on edge devices enhances privacy but faces performance hurdles due to limited resources. We introduce a systematic methodology to evaluate on-device LLMs, balancing capability, efficiency, and resource…

Machine Learning · Computer Science 2026-03-17 Qingyu Song , Rui Liu , Wei Lin , Peiyu Liao , Wenqian Zhao , Yiwen Wang , Shoubo Hu , Yining Jiang , Mochun Long , Hui-Ling Zhen , Ning Jiang , Mingxuan Yuan , Qiao Xiang , Hong Xu

NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models

Weight-only quantization has become a standard approach for efficiently serving large language models (LLMs). However, existing methods fail to efficiently compress models to binary (1-bit) levels, as they either require large amounts of…

Machine Learning · Computer Science 2026-05-19 Hyochan Chong , Dongkyu Kim , Changdong Kim , Minseop Choi

SLMQuant:Benchmarking Small Language Model Quantization for Practical Deployment

Despite the growing interest in Small Language Models (SLMs) as resource-efficient alternatives to Large Language Models (LLMs), their deployment on edge devices remains challenging due to unresolved efficiency gaps in model compression.…

Machine Learning · Computer Science 2025-11-18 Jiacheng Wang , Yejun Zeng , Jinyang Guo , Yuqing Ma , Aishan Liu , Xianglong Liu

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

Large language models (LLMs) have revolutionized natural language processing tasks. However, their practical deployment is hindered by their immense memory and computation requirements. Although recent post-training quantization (PTQ)…

Machine Learning · Computer Science 2024-03-19 Wenqi Shao , Mengzhao Chen , Zhaoyang Zhang , Peng Xu , Lirui Zhao , Zhiqian Li , Kaipeng Zhang , Peng Gao , Yu Qiao , Ping Luo

On the Compressibility of Quantized Large Language Models

Deploying Large Language Models (LLMs) on edge or mobile devices offers significant benefits, such as enhanced data privacy and real-time processing capabilities. However, it also faces critical challenges due to the substantial memory…

Machine Learning · Computer Science 2024-05-07 Yu Mao , Weilan Wang , Hongchao Du , Nan Guan , Chun Jason Xue

Squat: Quant Small Language Models on the Edge

A growing trend has emerged in designing high-quality Small Language Models (SLMs) with a few million parameters. This trend is driven by the increasing concerns over cloud costs, privacy, and latency. Considering that full parameter…

Machine Learning · Computer Science 2025-07-03 Xuan Shen , Peiyan Dong , Zhenglun Kong , Yifan Gong , Changdi Yang , Zhaoyang Han , Yanyue Xie , Lei Lu , Cheng Lyu , Chao Wu , Yanzhi Wang , Pu Zhao

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, existing methods cannot maintain accuracy and hardware efficiency at the same…

Computation and Language · Computer Science 2024-04-03 Guangxuan Xiao , Ji Lin , Mickael Seznec , Hao Wu , Julien Demouth , Song Han

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Large language models (LLMs) have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce the cloud computing cost and protect users' privacy. However, the…

Computation and Language · Computer Science 2026-04-28 Ji Lin , Jiaming Tang , Haotian Tang , Shang Yang , Wei-Ming Chen , Wei-Chen Wang , Guangxuan Xiao , Xingyu Dang , Chuang Gan , Song Han

WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More

Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on…

Machine Learning · Computer Science 2024-02-21 Yuxuan Yue , Zhihang Yuan , Haojie Duanmu , Sifan Zhou , Jianlong Wu , Liqiang Nie

FBQuant: FeedBack Quantization for Large Language Models

Deploying Large Language Models (LLMs) on edge devices is increasingly important, as it eliminates reliance on network connections, reduces expensive API calls, and enhances user privacy. However, on-device deployment is challenging due to…

Machine Learning · Computer Science 2025-05-26 Yijiang Liu , Hengyu Fang , Liulu He , Rongyu Zhang , Yichuan Bai , Yuan Du , Li Du

MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases

The deployment of Large Language Models (LLMs) and Large Multimodal Models (LMMs) on mobile devices has gained significant attention due to the benefits of enhanced privacy, stability, and personalization. However, the hardware constraints…

Computation and Language · Computer Science 2024-06-18 Rithesh Murthy , Liangwei Yang , Juntao Tan , Tulika Manoj Awalgaonkar , Yilun Zhou , Shelby Heinecke , Sachin Desai , Jason Wu , Ran Xu , Sarah Tan , Jianguo Zhang , Zhiwei Liu , Shirley Kokane , Zuxin Liu , Ming Zhu , Huan Wang , Caiming Xiong , Silvio Savarese

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

Increasing the number of parameters in large language models (LLMs) usually improves performance in downstream tasks but raises compute and memory costs, making deployment difficult in resource-limited settings. Quantization techniques,…

Computation and Language · Computer Science 2024-06-07 Renren Jin , Jiangcun Du , Wuwei Huang , Wei Liu , Jian Luan , Bin Wang , Deyi Xiong

LLMPi: Optimizing LLMs for High-Throughput on Raspberry Pi

Deploying Large Language Models (LLMs) on resource-constrained edge devices like the Raspberry Pi presents challenges in computational efficiency, power consumption, and response latency. This paper explores quantization-based optimization…

Machine Learning · Computer Science 2025-04-04 Mahsa Ardakani , Jinendra Malekar , Ramtin Zand

End-to-End On-Device Quantization-Aware Training for LLMs at Inference Cost

Quantization is an effective technique to reduce the deployment cost of large language models (LLMs), and post-training quantization (PTQ) has been widely studied due to its efficiency. However, existing PTQ methods are limited by their…

Machine Learning · Computer Science 2025-09-30 Qitao Tan , Xiaoying Song , Jin Lu , Guoming Li , Jun Liu , Lingzi Hong , Caiwen Ding , Jundong Li , Xiaoming Zhai , Shaoyi Huang , Wei Niu , Geng Yuan

Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models

Large Language Models (LLMs) deliver strong performance across a wide range of NLP tasks, but their massive sizes hinder deployment on resource-constrained devices. To reduce their computational and memory burden, various compression…

Machine Learning · Computer Science 2026-05-18 Dung Anh Hoang , Cuong Pham , Cuong Nguyen , Trung le , Jianfei Cai , Thanh-Toan Do

LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models

Large Language Models (LLMs) with multimodal capabilities have revolutionized vision-language tasks, but their deployment often requires huge memory and computational resources. While post-training quantization (PTQ) has successfully…

Computer Vision and Pattern Recognition · Computer Science 2025-10-28 Shubhang Bhatnagar , Andy Xu , Kar-Han Tan , Narendra Ahuja

A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms

Large language models (LLMs) have achieved remarkable advancements in natural language processing, showcasing exceptional performance across various tasks. However, the expensive memory and computational requirements present significant…

Artificial Intelligence · Computer Science 2025-11-13 Ruihao Gong , Yifu Ding , Zining Wang , Chengtao Lv , Xingyu Zheng , Jinyang Du , Haotong Qin , Jinyang Guo , Michele Magno , Xianglong Liu

On-Device Language Models: A Comprehensive Review

The advent of large language models (LLMs) revolutionized natural language processing applications, and running LLMs on edge devices has become increasingly attractive for reasons including reduced latency, data localization, and…

Computation and Language · Computer Science 2024-09-17 Jiajun Xu , Zhiyuan Li , Wei Chen , Qun Wang , Xin Gao , Qi Cai , Ziyuan Ling

GPTQT: Quantize Large Language Models Twice to Push the Efficiency

Due to their large size, generative Large Language Models (LLMs) require significant computing and storage resources. This paper introduces a new post-training quantization method, GPTQT, to reduce memory usage and enhance processing speed…

Machine Learning · Computer Science 2024-07-04 Yipin Guo , Yilin Lang , Qinyuan Ren