English
Related papers

Related papers: FBQuant: FeedBack Quantization for Large Language …

200 papers

Large language models (LLMs) have revolutionized language processing, delivering outstanding results across multiple applications. However, deploying LLMs on edge devices poses several challenges with respect to memory, energy, and compute…

Computation and Language · Computer Science 2024-10-07 Fuwen Tan , Royson Lee , Łukasz Dudziak , Shell Xu Hu , Sourav Bhattacharya , Timothy Hospedales , Georgios Tzimiropoulos , Brais Martinez

Large language models (LLMs) have revolutionized natural language processing tasks. However, their practical deployment is hindered by their immense memory and computation requirements. Although recent post-training quantization (PTQ)…

Machine Learning · Computer Science 2024-03-19 Wenqi Shao , Mengzhao Chen , Zhaoyang Zhang , Peng Xu , Lirui Zhao , Zhiqian Li , Kaipeng Zhang , Peng Gao , Yu Qiao , Ping Luo

Large language models (LLMs) have shown immense potential across various domains, but their high memory requirements and inference costs remain critical challenges for deployment. Post-training quantization (PTQ) has emerged as a promising…

Machine Learning · Computer Science 2026-01-05 Tianyi Zhang , Anshumali Shrivastava

Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on…

Machine Learning · Computer Science 2024-02-21 Yuxuan Yue , Zhihang Yuan , Haojie Duanmu , Sifan Zhou , Jianlong Wu , Liqiang Nie

Recently, quantization has been widely used for the compression and acceleration of large language models (LLMs). Due to the outliers in LLMs, it is crucial to flatten weights and activations to minimize quantization error with equally…

Computation and Language · Computer Science 2025-08-12 Yuxuan Sun , Ruikang Liu , Haoli Bai , Han Bao , Kang Zhao , Yuening Li , Jiaxin Hu , Xianzhi Yu , Lu Hou , Chun Yuan , Xin Jiang , Wulong Liu , Jun Yao

Large Language Models (LLMs) stand out for their impressive performance in intricate language modeling tasks. However, their demanding computational and memory needs pose obstacles for broad use on edge devices. Quantization is then…

Machine Learning · Computer Science 2025-04-22 Xuan Shen , Peiyan Dong , Lei Lu , Zhenglun Kong , Zhengang Li , Ming Lin , Chao Wu , Yanzhi Wang

Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment due to their substantial memory requirements. Furthermore, the latest generative models…

Machine Learning · Computer Science 2023-08-22 Young Jin Kim , Rawn Henry , Raffy Fahim , Hany Hassan Awadalla

As large language models (LLMs) grow in size and deployment scale, quantization has become an essential technique for reducing memory footprint and improving inference efficiency. However, existing quantization toolkits often lack…

Machine Learning · Computer Science 2025-12-01 Dong Liu , Yanxuan Yu

Quantization is an effective approach to reduce the memory footprint and inference cost of large language models (LLMs), yet maintaining performance in the ultra-low-bit regime remains challenging. Existing post-training methods often…

Machine Learning · Computer Science 2026-05-27 Phong Nam Huu Nguyen , Khoi M. Le , Cong-Duy T Nguyen , Anh Tuan Luu , Thong Thanh Nguyen , Tho Quan

The rapid advancement of large language models (LLMs) has exacerbated the memory bottleneck due to the widening gap between model parameter scaling and hardware capabilities. While post-training quantization techniques effectively reduce…

Machine Learning · Computer Science 2025-10-22 Fangxin Liu , Zongwu Wang , JinHong Xia , Junping Zhao , Shouren Zhao , Jinjin Li , Jian Liu , Li Jiang , Haibing Guan

Large language models (LLMs) have shown remarkable capabilities in various tasks. However their huge model size and the consequent demand for computational and memory resources also pose challenges to model deployment. Currently, 4-bit…

Machine Learning · Computer Science 2023-12-08 Jiayi Pan , Chengcan Wang , Kaifu Zheng , Yangguang Li , Zhenyu Wang , Bin Feng

Large language models (LLMs) have proven to be very superior to conventional methods in various tasks. However, their expensive computations and high memory requirements are prohibitive for deployment. Model quantization is an effective…

Artificial Intelligence · Computer Science 2024-03-06 Hanlin Tang , Yifu Sun , Decheng Wu , Kai Liu , Jianchen Zhu , Zhanhui Kang

Large language models (LLMs) deliver strong performance, but their high compute and memory costs make deployment difficult in resource-constrained scenarios. Weight-only post-training quantization (PTQ) is appealing, as it reduces memory…

Machine Learning · Computer Science 2026-02-09 Xianglong Yan , ChengZhu Bao , Zhiteng Li , Tianao Zhang , Shaoqiu Zhang , Ruobing Xie , Samm Sun , Yulun Zhang

Weight-only quantization has become a standard approach for efficiently serving large language models (LLMs). However, existing methods fail to efficiently compress models to binary (1-bit) levels, as they either require large amounts of…

Machine Learning · Computer Science 2026-05-19 Hyochan Chong , Dongkyu Kim , Changdong Kim , Minseop Choi

The emergence of accurate open large language models (LLMs) has sparked a push for advanced quantization techniques to enable efficient deployment on end-user devices. In this paper, we revisit the challenge of extreme LLM compression --…

Machine Learning · Computer Science 2026-04-09 Zhixiong Zhao , Fangxin Liu , Junjie Wang , Chenyang Guan , Zongwu Wang , Li Jiang , Haibing Guan

Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs), but may lead to large quantization errors when…

Large Language Models (LLMs) have become increasingly prominent for daily tasks, from improving sound-totext translation to generating additional frames for the latest video games. With the help of LLM inference frameworks, such as…

Hardware Architecture · Computer Science 2025-10-16 Jude Haris , José Cano

The deployment of large language models (LLMs) is frequently hindered by prohibitive memory and computational requirements. While quantization mitigates these bottlenecks, maintaining model fidelity in the sub-1-bit regime remains a…

Machine Learning · Computer Science 2026-02-06 Banseok Lee , Dongkyu Kim , Youngcheon You , Youngmin Kim

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks. However, their extensive memory requirements, particularly due to KV cache growth during long-text understanding and…

Computation and Language · Computer Science 2025-10-14 Haoqi Yang , Yao Yao , Zuchao Li , Baoyuan Qi , Guoming Liu , Hai Zhao

Quantization is an effective technique to reduce the deployment cost of large language models (LLMs), and post-training quantization (PTQ) has been widely studied due to its efficiency. However, existing PTQ methods are limited by their…

Machine Learning · Computer Science 2025-09-30 Qitao Tan , Xiaoying Song , Jin Lu , Guoming Li , Jun Liu , Lingzi Hong , Caiwen Ding , Jundong Li , Xiaoming Zhai , Shaoyi Huang , Wei Niu , Geng Yuan
‹ Prev 1 2 3 10 Next ›