English
Related papers

Related papers: Layer-Wise Quantization: A Pragmatic and Effective…

200 papers

Increasing the number of parameters in large language models (LLMs) usually improves performance in downstream tasks but raises compute and memory costs, making deployment difficult in resource-limited settings. Quantization techniques,…

Computation and Language · Computer Science 2024-06-07 Renren Jin , Jiangcun Du , Wuwei Huang , Wei Liu , Jian Luan , Bin Wang , Deyi Xiong

Large language models (LLMs) have achieved remarkable advancements in natural language processing, showcasing exceptional performance across various tasks. However, the expensive memory and computational requirements present significant…

Artificial Intelligence · Computer Science 2025-11-13 Ruihao Gong , Yifu Ding , Zining Wang , Chengtao Lv , Xingyu Zheng , Jinyang Du , Haotong Qin , Jinyang Guo , Michele Magno , Xianglong Liu

Large language models (LLMs) deliver impressive results for a variety of tasks, but state-of-the-art systems require fast GPUs with large amounts of memory. To reduce both the memory and latency of these systems, practitioners quantize…

Computer Vision and Pattern Recognition · Computer Science 2026-01-22 Gautom Das , Vincent La , Ethan Lau , Abhinav Shrivastava , Matthew Gwilliam

The inference of Large language models (LLMs) requires immense computation and memory resources. To curtail these costs, quantisation has merged as a promising solution, but existing LLM quantisation mainly focuses on 8-bit. In this work,…

Machine Learning · Computer Science 2024-03-15 Cheng Zhang , Jianyi Cheng , Ilia Shumailov , George A. Constantinides , Yiren Zhao

In this paper, we address post-training quantization (PTQ) for large language models (LLMs) from an overlooked perspective: given a pre-trained high-precision LLM, the predominant sequential quantization framework treats different layers…

Artificial Intelligence · Computer Science 2026-03-27 Shigeng Wang , Chao Li , Yangyuxuan Kang , Jiawei Fan , Zhonghong Ou , Anbang Yao

In this paper, we propose Mix-QViT, an explainability-driven MPQ framework that systematically allocates bit-widths to each layer based on two criteria: layer importance, assessed via Layer-wise Relevance Propagation (LRP), which identifies…

Computer Vision and Pattern Recognition · Computer Science 2025-01-14 Navin Ranjan , Andreas Savakis

Deploying Large Language Models (LLMs) on edge devices enhances privacy but faces performance hurdles due to limited resources. We introduce a systematic methodology to evaluate on-device LLMs, balancing capability, efficiency, and resource…

Large language models (LLMs) have revolutionized natural language processing. Understanding their internal mechanisms is crucial for developing more interpretable and optimized architectures. Mechanistic interpretability has led to the…

Large Language Models (LLMs) have been extensively researched and used in both academia and industry since the rise in popularity of the Transformer model, which demonstrates excellent performance in AI. However, the computational demands…

Machine Learning · Computer Science 2024-11-06 Jiedong Lang , Zhehao Guo , Shuyu Huang

Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and…

Machine Learning · Computer Science 2024-09-04 Yelysei Bondarenko , Riccardo Del Chiaro , Markus Nagel

Large Language Models (LLMs) with multimodal capabilities have revolutionized vision-language tasks, but their deployment often requires huge memory and computational resources. While post-training quantization (PTQ) has successfully…

Computer Vision and Pattern Recognition · Computer Science 2025-10-28 Shubhang Bhatnagar , Andy Xu , Kar-Han Tan , Narendra Ahuja

Large language models (LLMs) show impressive performance in solving complex language tasks. However, its large number of parameters presents significant challenges for the deployment. So, compressing LLMs to low bits can enable to deploy on…

Several post-training quantization methods have been applied to large language models (LLMs), and have been shown to perform well down to 8-bits. We find that these methods break down at lower bit precision, and investigate quantization…

Computation and Language · Computer Science 2023-05-30 Zechun Liu , Barlas Oguz , Changsheng Zhao , Ernie Chang , Pierre Stock , Yashar Mehdad , Yangyang Shi , Raghuraman Krishnamoorthi , Vikas Chandra

Low-bit weight-only quantization significantly reduces the memory footprint of large language models (LLMs), but disproportionately affects certain examples. We analyze diverse 3-4 bit methods on LLMs ranging from 7B-70B in size and find…

Machine Learning · Computer Science 2025-09-25 Ting-Yun Chang , Muru Zhang , Jesse Thomason , Robin Jia

Quantization is an essential step in the efficient deployment of deep learning models and as such is an increasingly popular research topic. An important practical aspect that is not addressed in the current literature is how to analyze and…

Machine Learning · Computer Science 2020-12-16 Shachar Gluska , Mark Grobman

Despite the superior performance, Large Language Models~(LLMs) require significant computational resources for deployment and use. To overcome this issue, quantization methods have been widely applied to reduce the memory footprint of LLMs…

Computation and Language · Computer Science 2023-07-27 Peiyu Liu , Zikang Liu , Ze-Feng Gao , Dawei Gao , Wayne Xin Zhao , Yaliang Li , Bolin Ding , Ji-Rong Wen

Large Language Models (LLMs) have shown an impressive capability in code generation. The LLM effectiveness generally increases with its size: The higher the number of LLM's trainable parameters the better its ability to implement code.…

Software Engineering · Computer Science 2026-01-28 Alessandro Giagnorio , Antonio Mastropaolo , Saima Afrin , Massimiliano Di Penta , Gabriele Bavota

Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining. Previous methods typically quantize the weights of each layer by "uniformly" optimizing the layer reconstruction loss across…

Machine Learning · Computer Science 2025-03-04 Yi-Lin Sung , Prateek Yadav , Jialu Li , Jaehong Yoon , Mohit Bansal

Deploying Large Language Models (LLMs) on resource-constrained edge devices like the Raspberry Pi presents challenges in computational efficiency, power consumption, and response latency. This paper explores quantization-based optimization…

Machine Learning · Computer Science 2025-04-04 Mahsa Ardakani , Jinendra Malekar , Ramtin Zand

Vision transformers (ViTs) have demonstrated remarkable performance across various visual tasks. However, ViT models suffer from substantial computational and memory requirements, making it challenging to deploy them on resource-constrained…

Computer Vision and Pattern Recognition · Computer Science 2024-01-23 Navin Ranjan , Andreas Savakis
‹ Prev 1 2 3 10 Next ›