English
Related papers

Related papers: Dual Grained Quantization: Efficient Fine-Grained …

200 papers

In the era of large-scale language models, the substantial parameter size poses significant challenges for deployment. Being a prevalent compression technique, quantization has emerged as the mainstream practice to tackle this issue, which…

Computation and Language · Computer Science 2023-08-31 Qingyuan Li , Yifan Zhang , Liang Li , Peng Yao , Bo Zhang , Xiangxiang Chu , Yerui Sun , Li Du , Yuchen Xie

Deep neural networks have achieved state-of-the-art results in a wide range of applications, from natural language processing and computer vision to speech recognition. However, as tasks become increasingly complex, model sizes continue to…

Computer Vision and Pattern Recognition · Computer Science 2025-05-21 Tomer Gafni , Asaf Karnieli , Yair Hanani

Large Language Models (LLMs) are proficient in natural language processing tasks, but their deployment is often restricted by extensive parameter sizes and computational demands. This paper focuses on post-training quantization (PTQ) in…

Computation and Language · Computer Science 2024-07-19 Janghwan Lee , Minsoo Kim , Seungcheol Baek , Seok Joong Hwang , Wonyong Sung , Jungwook Choi

Large language models (LLMs) deliver strong performance, but their high compute and memory costs make deployment difficult in resource-constrained scenarios. Weight-only post-training quantization (PTQ) is appealing, as it reduces memory…

Machine Learning · Computer Science 2026-02-09 Xianglong Yan , ChengZhu Bao , Zhiteng Li , Tianao Zhang , Shaoqiu Zhang , Ruobing Xie , Samm Sun , Yulun Zhang

Quantization is a key method for reducing the GPU memory requirement of training large language models (LLMs). Yet, current approaches are ineffective for 4-bit activations and 8-bit gradients, which would easily cause slow convergence or…

Computation and Language · Computer Science 2026-05-12 Wenxiang Lin , Juntao Huang , Luhan Zhang , Laili Li , Xiang Bao , Mengyang Zhang , Bing Wang , Shaohuai Shi

Large language models (LLMs) show impressive performance in solving complex language tasks. However, its large number of parameters presents significant challenges for the deployment. So, compressing LLMs to low bits can enable to deploy on…

Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment due to their substantial memory requirements. Furthermore, the latest generative models…

Machine Learning · Computer Science 2023-08-22 Young Jin Kim , Rawn Henry , Raffy Fahim , Hany Hassan Awadalla

Large Language Models (LLMs) stand out for their impressive performance in intricate language modeling tasks. However, their demanding computational and memory needs pose obstacles for broad use on edge devices. Quantization is then…

Machine Learning · Computer Science 2025-04-22 Xuan Shen , Peiyan Dong , Lei Lu , Zhenglun Kong , Zhengang Li , Ming Lin , Chao Wu , Yanzhi Wang

Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques…

Computation and Language · Computer Science 2025-05-02 Yujun Lin , Haotian Tang , Shang Yang , Zhekai Zhang , Guangxuan Xiao , Chuang Gan , Song Han

Quantization is a powerful tool to improve large language model (LLM) inference efficiency by utilizing more energy-efficient low-precision datapaths and reducing memory footprint. However, accurately quantizing LLM weights and activations…

Hardware Architecture · Computer Science 2025-04-22 Coleman Hooper , Charbel Sakr , Ben Keller , Rangharajan Venkatesan , Kurt Keutzer , Sophia Shao , Brucek Khailany

Large language models (LLMs) have demonstrated impressive abilities in various domains while the inference cost is expensive. Many previous studies exploit quantization methods to reduce LLM inference cost by reducing latency and memory…

Machine Learning · Computer Science 2024-11-12 Jinhao Li , Jiaming Xu , Shiyao Li , Shan Huang , Jun Liu , Yaoxiu Lian , Guohao Dai

Quantization is a widely-used compression technology to reduce the overhead of serving large language models (LLMs) on terminal devices and in cloud data centers. However, prevalent quantization methods, such as 8-bit weight-activation or…

Hardware Architecture · Computer Science 2024-10-17 Lian Liu , Haimeng Ren , Long Cheng , Zhaohui Xu , Yudong Pan , Mengdi Wang , Xiaowei Li , Yinhe Han , Ying Wang

Quantization is a proven effective method for compressing large language models. Although popular techniques like W8A8 and W4A16 effectively maintain model performance, they often fail to concurrently speed up the prefill and decoding…

Machine Learning · Computer Science 2024-08-01 Ying Zhang , Peng Zhang , Mincong Huang , Jingyang Xiang , Yujie Wang , Chao Wang , Yineng Zhang , Lei Yu , Chuan Liu , Wei Lin

This paper presents a comprehensive analysis of quantization techniques for optimizing Large Language Models (LLMs), specifically focusing on Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). Through empirical…

Machine Learning · Computer Science 2024-11-12 Jahid Hasan

Quantizing large language models (LLMs) to 1-bit precision significantly reduces computational costs, but existing quantization techniques suffer from noticeable performance degradation when using weight and activation precisions below 4…

Machine Learning · Computer Science 2025-07-01 Siqing Song , Chuang Wang , Ruiqi Wang , Yi Yang , Xu-Yao Zhang

Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs, restricting their practical deployment. While existing INT4/INT8 quantization reduces these costs, they often degrade…

Machine Learning · Computer Science 2025-11-04 Hao Zhang , Aining Jia , Weifeng Bu , Yushu Cai , Kai Sheng , Hao Chen , Xin He

Deploying large language models (LLMs) in resource-constrained environments is hindered by heavy computational and memory requirements. We present LBLLM, a lightweight binarization framework that achieves effective W(1+1)A4 quantization…

Machine Learning · Computer Science 2026-04-22 Siqing Song , Chuang Wang , Yong Lang , Yi Yang , Xu-Yao Zhang

Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This…

Computation and Language · Computer Science 2024-06-06 Sehoon Kim , Coleman Hooper , Amir Gholami , Zhen Dong , Xiuyu Li , Sheng Shen , Michael W. Mahoney , Kurt Keutzer

Large language models (LLMs) have significantly advanced the natural language processing paradigm but impose substantial demands on memory and computational resources. Quantization is one of the most effective ways to reduce memory…

Machine Learning · Computer Science 2025-04-29 Xilong Xie , Liang Wang , Limin Xiao , Meng Han , Lin Sun , Shuai Zheng , Xiangrong Xu

Large language models (LLMs) have shown immense potential across various domains, but their high memory requirements and inference costs remain critical challenges for deployment. Post-training quantization (PTQ) has emerged as a promising…

Machine Learning · Computer Science 2026-01-05 Tianyi Zhang , Anshumali Shrivastava
‹ Prev 1 2 3 10 Next ›