Related papers: Optimizing LLMs Using Quantization for Mobile Exec…

Evaluating Quantized Large Language Models

Post-training quantization (PTQ) has emerged as a promising technique to reduce the cost of large language models (LLMs). Specifically, PTQ can effectively mitigate memory consumption and reduce computational overhead in LLMs. To meet the…

Computation and Language · Computer Science 2024-06-07 Shiyao Li , Xuefei Ning , Luning Wang , Tengxuan Liu , Xiangsheng Shi , Shengen Yan , Guohao Dai , Huazhong Yang , Yu Wang

Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques

This paper presents a comprehensive analysis of quantization techniques for optimizing Large Language Models (LLMs), specifically focusing on Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). Through empirical…

Machine Learning · Computer Science 2024-11-12 Jahid Hasan

Can Post-Training Quantization Benefit from an Additional QLoRA Integration?

Large language models (LLMs) have transformed natural language processing but pose significant challenges for real-world deployment. These models necessitate considerable computing resources, which can be costly and frequently unavailable.…

Computation and Language · Computer Science 2025-02-17 Xiliang Zhu , Elena Khasanova , Cheng Chen

LLM Compression: How Far Can We Go in Balancing Size and Performance?

Quantization is an essential and popular technique for improving the accessibility of large language models (LLMs) by reducing memory usage and computational costs while maintaining performance. In this study, we apply 4-bit Group Scaling…

Computation and Language · Computer Science 2025-08-18 Sahil Sk , Debasish Dhal , Sonal Khosla , Sk Shahid , Sambit Shekhar , Akash Dhaka , Shantipriya Parida , Dilip K. Prasad , Ondřej Bojar

GPTQT: Quantize Large Language Models Twice to Push the Efficiency

Due to their large size, generative Large Language Models (LLMs) require significant computing and storage resources. This paper introduces a new post-training quantization method, GPTQT, to reduce memory usage and enhance processing speed…

Machine Learning · Computer Science 2024-07-04 Yipin Guo , Yilin Lang , Qinyuan Ren

The Uneven Impact of Post-Training Quantization in Machine Translation

Quantization is essential for deploying large language models (LLMs) on resource-constrained hardware, but its implications for multilingual tasks remain underexplored. We conduct the first large-scale evaluation of post-training…

Computation and Language · Computer Science 2025-08-29 Benjamin Marie , Atsushi Fujita

Post Training Quantization of Large Language Models with Microscaling Formats

Large Language Models (LLMs) have distinguished themselves with outstanding performance in complex language modeling tasks, yet they come with significant computational and storage challenges. This paper explores the potential of…

Machine Learning · Computer Science 2024-10-17 Sayeh Sharify , Utkarsh Saxena , Zifei Xu , Wanzin Yazar , Ilya Soloveychik , Xin Wang

LLMPi: Optimizing LLMs for High-Throughput on Raspberry Pi

Deploying Large Language Models (LLMs) on resource-constrained edge devices like the Raspberry Pi presents challenges in computational efficiency, power consumption, and response latency. This paper explores quantization-based optimization…

Machine Learning · Computer Science 2025-04-04 Mahsa Ardakani , Jinendra Malekar , Ramtin Zand

Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct

Quantization is a practical technique for making large language models easier to deploy by reducing the precision used to store and operate on model weights. This can lower memory use and improve runtime feasibility on constrained hardware,…

Machine Learning · Computer Science 2026-01-22 Uygar Kurt

MobileQuant: Mobile-friendly Quantization for On-device Language Models

Large language models (LLMs) have revolutionized language processing, delivering outstanding results across multiple applications. However, deploying LLMs on edge devices poses several challenges with respect to memory, energy, and compute…

Computation and Language · Computer Science 2024-10-07 Fuwen Tan , Royson Lee , Łukasz Dudziak , Shell Xu Hu , Sourav Bhattacharya , Timothy Hospedales , Georgios Tzimiropoulos , Brais Martinez

LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models

Large Language Models (LLMs) with multimodal capabilities have revolutionized vision-language tasks, but their deployment often requires huge memory and computational resources. While post-training quantization (PTQ) has successfully…

Computer Vision and Pattern Recognition · Computer Science 2025-10-28 Shubhang Bhatnagar , Andy Xu , Kar-Han Tan , Narendra Ahuja

Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models

Large Language Models (LLMs) deliver strong performance across a wide range of NLP tasks, but their massive sizes hinder deployment on resource-constrained devices. To reduce their computational and memory burden, various compression…

Machine Learning · Computer Science 2026-05-18 Dung Anh Hoang , Cuong Pham , Cuong Nguyen , Trung le , Jianfei Cai , Thanh-Toan Do

SEPTQ: A Simple and Effective Post-Training Quantization Paradigm for Large Language Models

Large language models (LLMs) have shown remarkable performance in various domains, but they are constrained by massive computational and storage costs. Quantization, an effective technique for compressing models to fit resource-limited…

Computation and Language · Computer Science 2026-04-14 Han Liu , Haotian Gao , Xiaotong Zhang , Changya Li , Feng Zhang , Wei Wang , Fenglong Ma , Hong Yu

Norm Tweaking: High-performance Low-bit Quantization of Large Language Models

As the size of large language models (LLMs) continues to grow, model compression without sacrificing accuracy has become a crucial challenge for deployment. While some quantization methods, such as GPTQ, have made progress in achieving…

Machine Learning · Computer Science 2023-12-14 Liang Li , Qingyuan Li , Bo Zhang , Xiangxiang Chu

A Comprehensive Evaluation on Quantization Techniques for Large Language Models

For large language models (LLMs), post-training quantization (PTQ) can significantly reduce memory footprint and computational overhead. Model quantization is rapidly evolving. Though many papers report breakthrough results, they are often…

Machine Learning · Computer Science 2026-01-30 Yutong Liu , Cairong Zhao , Guosheng Hu

Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis

Post-training Quantization (PTQ) technique has been extensively adopted for large language models (LLMs) compression owing to its efficiency and low resource requirement. However, current research lacks a in-depth analysis of the superior…

Machine Learning · Computer Science 2025-05-22 Jiaqi Zhao , Ming Wang , Miao Zhang , Yuzhang Shang , Xuebo Liu , Yaowei Wang , Min Zhang , Liqiang Nie

Performance Trade-offs of Optimizing Small Language Models for E-Commerce

Large Language Models (LLMs) offer state-of-the-art performance in natural language understanding and generation tasks. However, the deployment of leading commercial models for specialized tasks, such as e-commerce, is often hindered by…

Artificial Intelligence · Computer Science 2025-10-28 Josip Tomo Licardo , Nikola Tankovic

Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization

Large language models (LLMs) face the challenges in fine-tuning and deployment due to their high memory demands and computational costs. While parameter-efficient fine-tuning (PEFT) methods aim to reduce the memory usage of the optimizer…

Machine Learning · Computer Science 2023-10-31 Jeonghoon Kim , Jung Hyun Lee , Sungdong Kim , Joonsuk Park , Kang Min Yoo , Se Jung Kwon , Dongsoo Lee

On the Compressibility of Quantized Large Language Models

Deploying Large Language Models (LLMs) on edge or mobile devices offers significant benefits, such as enhanced data privacy and real-time processing capabilities. However, it also faces critical challenges due to the substantial memory…

Machine Learning · Computer Science 2024-05-07 Yu Mao , Weilan Wang , Hongchao Du , Nan Guan , Chun Jason Xue

L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models

Due to the high memory and computational costs associated with large language models (LLMs), model compression techniques such as quantization, which reduces inference costs, and parameter-efficient fine-tuning (PEFT) methods like Low-Rank…

Machine Learning · Computer Science 2025-07-23 Hyesung Jeon , Yulhwa Kim , Jae-joon Kim