Related papers: QuAILoRA: Quantization-Aware Initialization for Lo…

LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models

Quantization is an indispensable technique for serving Large Language Models (LLMs) and has recently found its way into LoRA fine-tuning. In this work we focus on the scenario where quantization and LoRA fine-tuning are applied together on…

Computation and Language · Computer Science 2023-11-29 Yixiao Li , Yifan Yu , Chen Liang , Pengcheng He , Nikos Karampatziakis , Weizhu Chen , Tuo Zhao

QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models

Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one…

Machine Learning · Computer Science 2023-10-10 Yuhui Xu , Lingxi Xie , Xiaotao Gu , Xin Chen , Heng Chang , Hengheng Zhang , Zhengsu Chen , Xiaopeng Zhang , Qi Tian

CLoQ: Enhancing Fine-Tuning of Quantized LLMs via Calibrated LoRA Initialization

Fine-tuning large language models (LLMs) using low-rank adaptation (LoRA) has become a highly efficient approach for downstream tasks, particularly in scenarios with limited computational resources. However, applying LoRA techniques to…

Machine Learning · Computer Science 2025-08-15 Yanxia Deng , Aozhong Zhang , Selcuk Gurses , Naigang Wang , Zi Yang , Penghang Yin

Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance

Large Language Models (LLMs) have demonstrated impressive performance across various domains. However, the enormous number of model parameters makes fine-tuning challenging, significantly limiting their application and deployment. Existing…

Machine Learning · Computer Science 2025-07-23 Ao Shen , Qiang Wang , Zhiquan Lai , Xionglve Li , Dongsheng Li

ModuLoRA: Finetuning 2-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers

We propose a memory-efficient finetuning algorithm for large language models (LLMs) that supports finetuning LLMs with 65B parameters in 2/3/4-bit precision on as little as one 24GB GPU. Our method, modular low-rank adaptation (ModuLoRA),…

Machine Learning · Computer Science 2024-03-12 Junjie Yin , Jiahao Dong , Yingheng Wang , Christopher De Sa , Volodymyr Kuleshov

L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models

Due to the high memory and computational costs associated with large language models (LLMs), model compression techniques such as quantization, which reduces inference costs, and parameter-efficient fine-tuning (PEFT) methods like Low-Rank…

Machine Learning · Computer Science 2025-07-23 Hyesung Jeon , Yulhwa Kim , Jae-joon Kim

ApiQ: Finetuning of 2-Bit Quantized Large Language Model

Memory-efficient finetuning of large language models (LLMs) has recently attracted huge attention with the increasing size of LLMs, primarily due to the constraints posed by GPU memory limitations and the effectiveness of these methods…

Machine Learning · Computer Science 2024-06-24 Baohao Liao , Christian Herold , Shahram Khadivi , Christof Monz

QuZO: Quantized Zeroth-Order Fine-Tuning for Large Language Models

Language Models (LLMs) are often quantized to lower precision to reduce the memory cost and latency in inference. However, quantization often degrades model performance, thus fine-tuning is required for various down-stream tasks.…

Machine Learning · Computer Science 2025-02-19 Jiajun Zhou , Yifan Yang , Kai Zhen , Ziyue Liu , Yequan Zhao , Ershad Banijamali , Athanasios Mouchtaris , Ngai Wong , Zheng Zhang

QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs

We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs). While RL is essential for LLMs' reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout…

Machine Learning · Computer Science 2025-10-14 Wei Huang , Yi Ge , Shuai Yang , Yicheng Xiao , Huizi Mao , Yujun Lin , Hanrong Ye , Sifei Liu , Ka Chun Cheung , Hongxu Yin , Yao Lu , Xiaojuan Qi , Song Han , Yukang Chen

Accurate LoRA-Finetuning Quantization of LLMs via Information Retention

The LoRA-finetuning quantization of LLMs has been extensively studied to obtain accurate yet compact LLMs for deployment on resource-constrained hardware. However, existing methods cause the quantized LLM to severely degrade and even fail…

Machine Learning · Computer Science 2024-05-28 Haotong Qin , Xudong Ma , Xingyu Zheng , Xiaoyang Li , Yang Zhang , Shouda Liu , Jie Luo , Xianglong Liu , Michele Magno

Low-Rank Quantization-Aware Training for LLMs

Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and…

Machine Learning · Computer Science 2024-09-04 Yelysei Bondarenko , Riccardo Del Chiaro , Markus Nagel

Can Post-Training Quantization Benefit from an Additional QLoRA Integration?

Large language models (LLMs) have transformed natural language processing but pose significant challenges for real-world deployment. These models necessitate considerable computing resources, which can be costly and frequently unavailable.…

Computation and Language · Computer Science 2025-02-17 Xiliang Zhu , Elena Khasanova , Cheng Chen

LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

We propose a simple approach for memory-efficient adaptation of pretrained language models. Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient…

Computation and Language · Computer Science 2024-08-28 Han Guo , Philip Greengard , Eric P. Xing , Yoon Kim

QLoRA: Efficient Finetuning of Quantized LLMs

We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a…

Machine Learning · Computer Science 2023-05-24 Tim Dettmers , Artidoro Pagnoni , Ari Holtzman , Luke Zettlemoyer

INT2.1: Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation

We introduce a method that dramatically reduces fine-tuning VRAM requirements and rectifies quantization errors in quantized Large Language Models. First, we develop an extremely memory-efficient fine-tuning (EMEF) method for quantized…

Computation and Language · Computer Science 2023-06-16 Yuji Chai , John Gkountouras , Glenn G. Ko , David Brooks , Gu-Yeon Wei

Quantization-Robust LLM Unlearning via Low-Rank Adaptation

Large Language Model (LLM) unlearning aims to remove targeted knowledge from a trained model, but practical deployments often require post-training quantization (PTQ) for efficient inference. However, aggressive low-bit PTQ can mask…

Machine Learning · Computer Science 2026-04-08 João Vitor Boer Abitante , Joana Meneguzzo Pasquali , Luan Fonseca Garcia , Ewerton de Oliveira , Thomas da Silva Paula , Rodrigo C. Barros , Lucas S. Kupssinskü

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

Increasing the number of parameters in large language models (LLMs) usually improves performance in downstream tasks but raises compute and memory costs, making deployment difficult in resource-limited settings. Quantization techniques,…

Computation and Language · Computer Science 2024-06-07 Renren Jin , Jiangcun Du , Wuwei Huang , Wei Liu , Jian Luan , Bin Wang , Deyi Xiong

QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources

Large Language Models (LLMs) have showcased remarkable impacts across a wide spectrum of natural language processing tasks. Fine-tuning these pretrained models on downstream datasets provides further significant performance gains; however,…

Computation and Language · Computer Science 2026-03-19 Zhikai Li , Xiaoxuan Liu , Banghua Zhu , Zhen Dong , Qingyi Gu , Kurt Keutzer

LoRA-GA: Low-Rank Adaptation with Gradient Approximation

Fine-tuning large-scale pretrained models is prohibitively expensive in terms of computational and memory costs. LoRA, as one of the most popular Parameter-Efficient Fine-Tuning (PEFT) methods, offers a cost-effective alternative by…

Machine Learning · Computer Science 2024-07-17 Shaowen Wang , Linxi Yu , Jian Li

FineQ: Software-Hardware Co-Design for Low-Bit Fine-Grained Mixed-Precision Quantization of LLMs

Large language models (LLMs) have significantly advanced the natural language processing paradigm but impose substantial demands on memory and computational resources. Quantization is one of the most effective ways to reduce memory…

Machine Learning · Computer Science 2025-04-29 Xilong Xie , Liang Wang , Limin Xiao , Meng Han , Lin Sun , Shuai Zheng , Xiangrong Xu