English
Related papers

Related papers: FlatQuant: Flatness Matters for LLM Quantization

200 papers

The significant resource requirements associated with Large-scale Language Models (LLMs) have generated considerable interest in the development of techniques aimed at compressing and accelerating neural networks. Among these techniques,…

Machine Learning · Computer Science 2024-03-20 Yuexiao Ma , Huixia Li , Xiawu Zheng , Feng Ling , Xuefeng Xiao , Rui Wang , Shilei Wen , Fei Chao , Rongrong Ji

Large language models (LLMs) require substantial compute, and thus energy, at inference time. While quantizing weights and activations is effective at improving efficiency, naive quantization of LLMs can significantly degrade performance…

Machine Learning · Computer Science 2025-06-06 Boris van Breugel , Yelysei Bondarenko , Paul Whatmough , Markus Nagel

Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, existing methods cannot maintain accuracy and hardware efficiency at the same…

Computation and Language · Computer Science 2024-04-03 Guangxuan Xiao , Ji Lin , Mickael Seznec , Hao Wu , Julien Demouth , Song Han

Large language models (LLMs) have shown remarkable capabilities in various tasks. However their huge model size and the consequent demand for computational and memory resources also pose challenges to model deployment. Currently, 4-bit…

Machine Learning · Computer Science 2023-12-08 Jiayi Pan , Chengcan Wang , Kaifu Zheng , Yangguang Li , Zhenyu Wang , Bin Feng

Large language models (LLMs) have revolutionized natural language processing tasks. However, their practical deployment is hindered by their immense memory and computation requirements. Although recent post-training quantization (PTQ)…

Machine Learning · Computer Science 2024-03-19 Wenqi Shao , Mengzhao Chen , Zhaoyang Zhang , Peng Xu , Lirui Zhao , Zhiqian Li , Kaipeng Zhang , Peng Gao , Yu Qiao , Ping Luo

Existing weight-activation quantization methods for Large Language Models (LLMs) primarily address channel-wise outliers but often neglect token-wise outliers, which limits the accuracy of quantized models. In this work, we propose…

Machine Learning · Computer Science 2025-01-28 Mengzhao Chen , Yi Liu , Jiahao Wang , Yi Bin , Wenqi Shao , Ping Luo

Improving the efficiency of inference in Large Language Models (LLMs) is a critical area of research. Post-training Quantization (PTQ) is a popular technique, but it often faces challenges at low-bit levels, particularly in downstream…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Wenjin Ke , Zhe Li , Dong Li , Lu Tian , Emad Barsoum

Large language models require significant computational resources for deployment, making quantization essential for practical applications. However, the main obstacle to effective quantization lies in systematic outliers in activations and…

Machine Learning · Computer Science 2025-11-25 Cuong Pham , Hoang Anh Dung , Cuong C. Nguyen , Trung Le , Gustavo Carneiro , Jianfei Cai , Thanh-Toan Do

Quantizing the activations of large language models (LLMs) has been a significant challenge due to the presence of structured outliers. Most existing methods focus on the per-token or per-tensor quantization of activations, making it…

Computation and Language · Computer Science 2024-06-28 Jinguang Wang , Yuexi Yin , Haifeng Sun , Qi Qi , Jingyu Wang , Zirui Zhuang , Tingting Yang , Jianxin Liao

Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs), but may lead to large quantization errors when…

Large language models (LLMs) have shown immense potential across various domains, but their high memory requirements and inference costs remain critical challenges for deployment. Post-training quantization (PTQ) has emerged as a promising…

Machine Learning · Computer Science 2026-01-05 Tianyi Zhang , Anshumali Shrivastava

Deploying Large Language Models (LLMs) on edge devices is increasingly important, as it eliminates reliance on network connections, reduces expensive API calls, and enhances user privacy. However, on-device deployment is challenging due to…

Machine Learning · Computer Science 2025-05-26 Yijiang Liu , Hengyu Fang , Liulu He , Rongyu Zhang , Yichuan Bai , Yuan Du , Li Du

Rotation-based Post-Training Quantization (PTQ) has emerged as a promising solution for mitigating activation outliers in the quantization of Large Language Models (LLMs). Global rotation methods achieve inference efficiency by fusing…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Suyoung Kim , Sunghyun Wee , Hyeonjin Kim , Kyomin Hwang , Hyunho Lee , Nojun Kwak

The rapid advancement of large language models (LLMs) has exacerbated the memory bottleneck due to the widening gap between model parameter scaling and hardware capabilities. While post-training quantization techniques effectively reduce…

Machine Learning · Computer Science 2025-10-22 Fangxin Liu , Zongwu Wang , JinHong Xia , Junping Zhao , Shouren Zhao , Jinjin Li , Jian Liu , Li Jiang , Haibing Guan

For large language models (LLMs), post-training quantization (PTQ) can significantly reduce memory footprint and computational overhead. Model quantization is rapidly evolving. Though many papers report breakthrough results, they are often…

Machine Learning · Computer Science 2026-01-30 Yutong Liu , Cairong Zhao , Guosheng Hu

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical…

Computation and Language · Computer Science 2024-04-09 Jing Liu , Ruihao Gong , Xiuying Wei , Zhiwei Dong , Jianfei Cai , Bohan Zhuang

Weight-only quantization has become a standard approach for efficiently serving large language models (LLMs). However, existing methods fail to efficiently compress models to binary (1-bit) levels, as they either require large amounts of…

Machine Learning · Computer Science 2026-05-19 Hyochan Chong , Dongkyu Kim , Changdong Kim , Minseop Choi

We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values, in a post-training manner. Existing post-training quantization (PTQ) solutions are primarily integer-based…

Computation and Language · Computer Science 2024-04-30 Shih-yang Liu , Zechun Liu , Xijie Huang , Pingcheng Dong , Kwang-Ting Cheng

Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and…

Machine Learning · Computer Science 2024-09-04 Yelysei Bondarenko , Riccardo Del Chiaro , Markus Nagel

Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs, restricting their practical deployment. While existing INT4/INT8 quantization reduces these costs, they often degrade…

Machine Learning · Computer Science 2025-11-04 Hao Zhang , Aining Jia , Weifeng Bu , Yushu Cai , Kai Sheng , Hao Chen , Xin He
‹ Prev 1 2 3 10 Next ›