Related papers: Matryoshka Quantization

MatGPTQ: Accurate and Efficient Post-Training Matryoshka Quantization

Matryoshka Quantization (MatQuant) is a recent quantization approach showing that a single integer-quantized model can be served across multiple precisions, by slicing the most significant bits (MSB) at inference time. This enables a single…

Machine Learning · Computer Science 2026-02-04 Maximilian Kleinegger , Elvir Crnčević , Dan Alistarh

Mixed-Precision Inference Quantization: Radically Towards Faster inference speed, Lower Storage requirement, and Lower Loss

Based on the model's resilience to computational noise, model quantization is important for compressing models and improving computing speed. Existing quantization techniques rely heavily on experience and "fine-tuning" skills. In the…

Machine Learning · Computer Science 2022-07-22 Daning Cheng , Wenguang Chen

Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques

This paper presents a comprehensive analysis of quantization techniques for optimizing Large Language Models (LLMs), specifically focusing on Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). Through empirical…

Machine Learning · Computer Science 2024-11-12 Jahid Hasan

STQuant: Spatio-Temporal Adaptive Framework for Optimizer Quantization in Large Multimodal Model Training

Quantization is an effective way to reduce the memory cost of large-scale model training. However, most existing methods adopt fixed-precision policies, which ignore the fact that optimizer-state distributions vary significantly across…

Machine Learning · Computer Science 2026-04-10 Minglu Liu , Cunchen Hu , Liangliang Xu , Fengming Tang , Ruijia Wang , Fu Yu

MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search

Quantization is a technique for creating efficient Deep Neural Networks (DNNs), which involves performing computations and storing tensors at lower bit-widths than f32 floating point precision. Quantization reduces model size and inference…

Machine Learning · Computer Science 2023-10-02 Eliska Kloberdanz , Wei Le

FrameQuant: Flexible Low-Bit Quantization for Transformers

Transformers are the backbone of powerful foundation models for many Vision and Natural Language Processing tasks. But their compute and memory/storage footprint is large, and so, serving such models is expensive often requiring high-end…

Machine Learning · Computer Science 2024-08-01 Harshavardhan Adepu , Zhanpeng Zeng , Li Zhang , Vikas Singh

MRQ:Support Multiple Quantization Schemes through Model Re-Quantization

Despite the proliferation of diverse hardware accelerators (e.g., NPU, TPU, DPU), deploying deep learning models on edge devices with fixed-point hardware is still challenging due to complex model quantization and conversion. Existing model…

Machine Learning · Computer Science 2023-08-07 Manasa Manohara , Sankalp Dayal , Tariq Afzal , Rahul Bakshi , Kahkuen Fu

Low-bit Quantization of Neural Networks for Efficient Inference

Recent machine learning methods use increasingly large deep neural networks to achieve state of the art results in various tasks. The gains in performance come at the cost of a substantial increase in computation and storage requirements.…

Machine Learning · Computer Science 2019-03-26 Yoni Choukroun , Eli Kravchik , Fan Yang , Pavel Kisilev

Retraining-Based Iterative Weight Quantization for Deep Neural Networks

Model compression has gained a lot of attention due to its ability to reduce hardware resource requirements significantly while maintaining accuracy of DNNs. Model compression is especially useful for memory-intensive recurrent neural…

Machine Learning · Computer Science 2018-05-30 Dongsoo Lee , Byeongwook Kim

Efficient Multi-bit Quantization Network Training via Weight Bias Correction and Bit-wise Coreset Sampling

Multi-bit quantization networks enable flexible deployment of deep neural networks by supporting multiple precision levels within a single model. However, existing approaches suffer from significant training overhead as full-dataset updates…

Computer Vision and Pattern Recognition · Computer Science 2025-10-24 Jinhee Kim , Jae Jun An , Kang Eun Jeon , Jong Hwan Ko

When Less is More: 8-bit Quantization Improves Continual Learning in Large Language Models

Catastrophic forgetting poses a fundamental challenge in continual learning, particularly when models are quantized for deployment efficiency. We systematically investigate the interplay between quantization precision (FP16, INT8, INT4) and…

Machine Learning · Computer Science 2025-12-23 Michael S. Zhang , Rishi A. Ruia , Arnav Kewalram , Saathvik Dharmapuram , Utkarsh Sharma , Kevin Zhu

decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points

Quantization emerges as one of the most promising compression technologies for deploying efficient large models for various real time application in recent years. Considering that the storage and IO of weights take up the vast majority of…

Machine Learning · Computer Science 2024-04-22 Yi Guo , Fanliu Kong , Xiaoyang Li , Hui Li , Wei Chen , Xiaogang Tian , Jinping Cai , Yang Zhang , Shouda Liu

FracBits: Mixed Precision Quantization via Fractional Bit-Widths

Model quantization helps to reduce model size and latency of deep neural networks. Mixed precision quantization is favorable with customized hardwares supporting arithmetic operations at multiple bit-widths to achieve maximum efficiency. We…

Computer Vision and Pattern Recognition · Computer Science 2020-12-04 Linjie Yang , Qing Jin

Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview

This paper provides a comprehensive overview of the principles, challenges, and methodologies associated with quantizing large-scale neural network models. As neural networks have evolved towards larger and more complex architectures to…

Machine Learning · Computer Science 2024-09-19 Yanshu Wang , Tong Yang , Xiyan Liang , Guoan Wang , Hanning Lu , Xu Zhe , Yaoming Li , Li Weitao

A White Paper on Neural Network Quantization

While neural networks have advanced the frontiers in many applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is key if we want to integrate modern networks into edge…

Machine Learning · Computer Science 2021-06-16 Markus Nagel , Marios Fournarakis , Rana Ali Amjad , Yelysei Bondarenko , Mart van Baalen , Tijmen Blankevoort

Task Vector Quantization for Memory-Efficient Model Merging

Model merging enables efficient multi-task models by combining task-specific fine-tuned checkpoints. However, storing multiple task-specific checkpoints requires significant memory, limiting scalability and restricting model merging to…

Machine Learning · Computer Science 2025-08-08 Youngeun Kim , Seunghwan Lee , Aecheon Jung , Bogon Ryu , Sungeun Hong

Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs

Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point…

Computer Vision and Pattern Recognition · Computer Science 2024-07-08 Shivam Aggarwal , Hans Jakob Damsgaard , Alessandro Pappalardo , Giuseppe Franco , Thomas B. Preußer , Michaela Blott , Tulika Mitra

An Empirical Study of Low Precision Quantization for TinyML

Tiny machine learning (tinyML) has emerged during the past few years aiming to deploy machine learning models to embedded AI processors with highly constrained memory and computation capacity. Low precision quantization is an important…

Machine Learning · Computer Science 2022-03-11 Shaojie Zhuo , Hongyu Chen , Ramchalam Kinattinkara Ramakrishnan , Tommy Chen , Chen Feng , Yicheng Lin , Parker Zhang , Liang Shen

Training with Quantization Noise for Extreme Model Compression

We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the…

Machine Learning · Computer Science 2021-03-02 Angela Fan , Pierre Stock , Benjamin Graham , Edouard Grave , Remi Gribonval , Herve Jegou , Armand Joulin

Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation

Quantization techniques can reduce the size of Deep Neural Networks and improve inference latency and throughput by taking advantage of high throughput integer instructions. In this paper we review the mathematical aspects of quantization…

Machine Learning · Computer Science 2020-04-22 Hao Wu , Patrick Judd , Xiaojie Zhang , Mikhail Isaev , Paulius Micikevicius