English
Related papers

Related papers: EfficientQAT: Efficient Quantization-Aware Trainin…

200 papers

Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and…

Machine Learning · Computer Science 2024-09-04 Yelysei Bondarenko , Riccardo Del Chiaro , Markus Nagel

Large language models (LLMs) have shown immense potential across various domains, but their high memory requirements and inference costs remain critical challenges for deployment. Post-training quantization (PTQ) has emerged as a promising…

Machine Learning · Computer Science 2026-01-05 Tianyi Zhang , Anshumali Shrivastava

Quantization is an effective technique to reduce the deployment cost of large language models (LLMs), and post-training quantization (PTQ) has been widely studied due to its efficiency. However, existing PTQ methods are limited by their…

Machine Learning · Computer Science 2025-09-30 Qitao Tan , Xiaoying Song , Jin Lu , Guoming Li , Jun Liu , Lingzi Hong , Caiwen Ding , Jundong Li , Xiaoming Zhai , Shaoyi Huang , Wei Niu , Geng Yuan

Improving the efficiency of inference in Large Language Models (LLMs) is a critical area of research. Post-training Quantization (PTQ) is a popular technique, but it often faces challenges at low-bit levels, particularly in downstream…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Wenjin Ke , Zhe Li , Dong Li , Lu Tian , Emad Barsoum

Large language models (LLMs) have revolutionized natural language processing tasks. However, their practical deployment is hindered by their immense memory and computation requirements. Although recent post-training quantization (PTQ)…

Machine Learning · Computer Science 2024-03-19 Wenqi Shao , Mengzhao Chen , Zhaoyang Zhang , Peng Xu , Lirui Zhao , Zhiqian Li , Kaipeng Zhang , Peng Gao , Yu Qiao , Ping Luo

Large language models (LLMs) have proven to be very superior to conventional methods in various tasks. However, their expensive computations and high memory requirements are prohibitive for deployment. Model quantization is an effective…

Artificial Intelligence · Computer Science 2024-03-06 Hanlin Tang , Yifu Sun , Decheng Wu , Kai Liu , Jianchen Zhu , Zhanhui Kang

Several post-training quantization methods have been applied to large language models (LLMs), and have been shown to perform well down to 8-bits. We find that these methods break down at lower bit precision, and investigate quantization…

Computation and Language · Computer Science 2023-05-30 Zechun Liu , Barlas Oguz , Changsheng Zhao , Ernie Chang , Pierre Stock , Yashar Mehdad , Yangyang Shi , Raghuraman Krishnamoorthi , Vikas Chandra

Quantization-aware training (QAT) schemes have been shown to achieve near-full precision accuracy. They accomplish this by training a quantized model for multiple epochs. This is computationally expensive, mainly because of the full…

Machine Learning · Computer Science 2024-11-19 Saleh Ashkboos , Bram Verhoef , Torsten Hoefler , Evangelos Eleftheriou , Martino Dazzi

Quantization-aware training (QAT) is a common paradigm for network quantization, in which the training phase incorporates the simulation of the low-precision computation to optimize the quantization parameters in alignment with the task…

Machine Learning · Computer Science 2024-12-23 Chengting Yu , Shu Yang , Fengzhao Zhang , Hanzhi Ma , Aili Wang , Er-Ping Li

Due to the high memory and computational costs associated with large language models (LLMs), model compression techniques such as quantization, which reduces inference costs, and parameter-efficient fine-tuning (PEFT) methods like Low-Rank…

Machine Learning · Computer Science 2025-07-23 Hyesung Jeon , Yulhwa Kim , Jae-joon Kim

This paper presents a comprehensive analysis of quantization techniques for optimizing Large Language Models (LLMs), specifically focusing on Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). Through empirical…

Machine Learning · Computer Science 2024-11-12 Jahid Hasan

Quantization-aware training (QAT) is a leading technique for improving the accuracy of quantized neural networks. Previous work has shown that decomposing training into a full-precision (FP) phase followed by a QAT phase yields superior…

Machine Learning · Computer Science 2026-02-27 Aleksandr Dremov , David Grangier , Angelos Katharopoulos , Awni Hannun

Large language models can be quantized to reduce inference time latency, model size, and energy consumption, thereby delivering a better user experience at lower cost. A challenge exists to deliver quantized models with minimal loss of…

Machine Learning · Computer Science 2025-07-24 Steven K. Esser , Jeffrey L. McKinstry , Deepika Bablani , Rathinakumar Appuswamy , Dharmendra S. Modha

Large Language Models (LLMs) have driven significant progress, yet their growing parameter counts and context windows incur prohibitive compute, energy, and monetary costs. We introduce EfficientLLM, a novel benchmark and the first…

Quantization is a key method for reducing the GPU memory requirement of training large language models (LLMs). Yet, current approaches are ineffective for 4-bit activations and 8-bit gradients, which would easily cause slow convergence or…

Computation and Language · Computer Science 2026-05-12 Wenxiang Lin , Juntao Huang , Luhan Zhang , Laili Li , Xiang Bao , Mengyang Zhang , Bing Wang , Shaohuai Shi

Large Language Models (LLMs) stand out for their impressive performance in intricate language modeling tasks. However, their demanding computational and memory needs pose obstacles for broad use on edge devices. Quantization is then…

Machine Learning · Computer Science 2025-04-22 Xuan Shen , Peiyan Dong , Lei Lu , Zhenglun Kong , Zhengang Li , Ming Lin , Chao Wu , Yanzhi Wang

Quantization-Aware Training from scratch has emerged as a promising approach for building efficient large language models (LLMs) with extremely low-bit weights (sub 2-bit), which can offer substantial advantages for edge deployment.…

Machine Learning · Computer Science 2026-02-27 Wenzheng Zhang , Bingzheng Liu , Yang Hu , Xiaoying Bai , Wentao Zhang , Bin Cui

Hybrid models that combine convolutional and transformer blocks offer strong performance in computer vision (CV) tasks but are resource-intensive for edge deployment. Although post-training quantization (PTQ) can help reduce resource…

Computer Vision and Pattern Recognition · Computer Science 2025-06-16 Shaibal Saha , Lanyu Xu

One approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, the question of obtaining…

Machine Learning · Computer Science 2025-06-12 Andrei Panferov , Jiale Chen , Soroush Tabesh , Roberto L. Castro , Mahdi Nikdan , Dan Alistarh

In Large Language Models (LLMs), the number of parameters has grown exponentially in the past few years, e.g., from 1.5 billion parameters in GPT-2 to 175 billion in GPT-3 to possibly more than trillion in higher versions. This raises a…

Computation and Language · Computer Science 2026-01-06 Mahmoud Elgenedy
‹ Prev 1 2 3 10 Next ›