English

Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM

Artificial Intelligence 2023-10-10 v1

Abstract

Large Language Models (LLMs) pose significant hardware challenges related to memory requirements and computational ability. There are two mainstream quantization schemes for LLMs: coarse-grained (e.g.,\textit{e.g.,} channel-wise) quantization and fine-grained (e.g.,\textit{e.g.,} group-wise) quantization. Fine-grained quantization has smaller quantization loss, consequently achieving superior performance. However, when applied to weight-activation quantization, it disrupts continuous integer matrix multiplication, leading to inefficient inference. In this paper, we introduce Dual Grained Quantization (DGQ), a novel A8W4 quantization for LLM that maintains superior performance while ensuring fast inference speed. DSQ dequantizes the fine-grained INT4 weight into coarse-grained INT8 representation and preform matrix multiplication using INT8 kernels. Besides, we develop a two-phase grid search algorithm to simplify the determination of fine-grained and coarse-grained quantization scales. We also devise a percentile clipping schema for smoothing the activation outliers without the need for complex optimization techniques. Experimental results demonstrate that DGQ consistently outperforms prior methods across various LLM architectures and a wide range of tasks. Remarkably, by our implemented efficient CUTLASS kernel, we achieve 1.12\textbf{1.12} ×\times memory reduction and 3.24\textbf{3.24} ×\times speed gains comparing A16W4 implementation. These advancements enable efficient deployment of A8W4 LLMs for real-world applications.

Keywords

Cite

@article{arxiv.2310.04836,
  title  = {Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM},
  author = {Luoming Zhang and Wen Fei and Weijia Wu and Yefei He and Zhenyu Lou and Hong Zhou},
  journal= {arXiv preprint arXiv:2310.04836},
  year   = {2023}
}

Comments

15 pages, 2 figures

R2 v1 2026-06-28T12:43:26.248Z