English
Related papers

Related papers: Matryoshka Quantization

200 papers

Matryoshka Quantization (MatQuant) is a recent quantization approach showing that a single integer-quantized model can be served across multiple precisions, by slicing the most significant bits (MSB) at inference time. This enables a single…

Machine Learning · Computer Science 2026-02-04 Maximilian Kleinegger , Elvir Crnčević , Dan Alistarh

Based on the model's resilience to computational noise, model quantization is important for compressing models and improving computing speed. Existing quantization techniques rely heavily on experience and "fine-tuning" skills. In the…

Machine Learning · Computer Science 2022-07-22 Daning Cheng , Wenguang Chen

This paper presents a comprehensive analysis of quantization techniques for optimizing Large Language Models (LLMs), specifically focusing on Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). Through empirical…

Machine Learning · Computer Science 2024-11-12 Jahid Hasan

Quantization is an effective way to reduce the memory cost of large-scale model training. However, most existing methods adopt fixed-precision policies, which ignore the fact that optimizer-state distributions vary significantly across…

Machine Learning · Computer Science 2026-04-10 Minglu Liu , Cunchen Hu , Liangliang Xu , Fengming Tang , Ruijia Wang , Fu Yu

Quantization is a technique for creating efficient Deep Neural Networks (DNNs), which involves performing computations and storing tensors at lower bit-widths than f32 floating point precision. Quantization reduces model size and inference…

Machine Learning · Computer Science 2023-10-02 Eliska Kloberdanz , Wei Le

Transformers are the backbone of powerful foundation models for many Vision and Natural Language Processing tasks. But their compute and memory/storage footprint is large, and so, serving such models is expensive often requiring high-end…

Machine Learning · Computer Science 2024-08-01 Harshavardhan Adepu , Zhanpeng Zeng , Li Zhang , Vikas Singh

Despite the proliferation of diverse hardware accelerators (e.g., NPU, TPU, DPU), deploying deep learning models on edge devices with fixed-point hardware is still challenging due to complex model quantization and conversion. Existing model…

Machine Learning · Computer Science 2023-08-07 Manasa Manohara , Sankalp Dayal , Tariq Afzal , Rahul Bakshi , Kahkuen Fu

Recent machine learning methods use increasingly large deep neural networks to achieve state of the art results in various tasks. The gains in performance come at the cost of a substantial increase in computation and storage requirements.…

Machine Learning · Computer Science 2019-03-26 Yoni Choukroun , Eli Kravchik , Fan Yang , Pavel Kisilev

Model compression has gained a lot of attention due to its ability to reduce hardware resource requirements significantly while maintaining accuracy of DNNs. Model compression is especially useful for memory-intensive recurrent neural…

Machine Learning · Computer Science 2018-05-30 Dongsoo Lee , Byeongwook Kim

Multi-bit quantization networks enable flexible deployment of deep neural networks by supporting multiple precision levels within a single model. However, existing approaches suffer from significant training overhead as full-dataset updates…

Computer Vision and Pattern Recognition · Computer Science 2025-10-24 Jinhee Kim , Jae Jun An , Kang Eun Jeon , Jong Hwan Ko

Catastrophic forgetting poses a fundamental challenge in continual learning, particularly when models are quantized for deployment efficiency. We systematically investigate the interplay between quantization precision (FP16, INT8, INT4) and…

Machine Learning · Computer Science 2025-12-23 Michael S. Zhang , Rishi A. Ruia , Arnav Kewalram , Saathvik Dharmapuram , Utkarsh Sharma , Kevin Zhu

Quantization emerges as one of the most promising compression technologies for deploying efficient large models for various real time application in recent years. Considering that the storage and IO of weights take up the vast majority of…

Machine Learning · Computer Science 2024-04-22 Yi Guo , Fanliu Kong , Xiaoyang Li , Hui Li , Wei Chen , Xiaogang Tian , Jinping Cai , Yang Zhang , Shouda Liu

Model quantization helps to reduce model size and latency of deep neural networks. Mixed precision quantization is favorable with customized hardwares supporting arithmetic operations at multiple bit-widths to achieve maximum efficiency. We…

Computer Vision and Pattern Recognition · Computer Science 2020-12-04 Linjie Yang , Qing Jin

This paper provides a comprehensive overview of the principles, challenges, and methodologies associated with quantizing large-scale neural network models. As neural networks have evolved towards larger and more complex architectures to…

Machine Learning · Computer Science 2024-09-19 Yanshu Wang , Tong Yang , Xiyan Liang , Guoan Wang , Hanning Lu , Xu Zhe , Yaoming Li , Li Weitao

While neural networks have advanced the frontiers in many applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is key if we want to integrate modern networks into edge…

Machine Learning · Computer Science 2021-06-16 Markus Nagel , Marios Fournarakis , Rana Ali Amjad , Yelysei Bondarenko , Mart van Baalen , Tijmen Blankevoort

Model merging enables efficient multi-task models by combining task-specific fine-tuned checkpoints. However, storing multiple task-specific checkpoints requires significant memory, limiting scalability and restricting model merging to…

Machine Learning · Computer Science 2025-08-08 Youngeun Kim , Seunghwan Lee , Aecheon Jung , Bogon Ryu , Sungeun Hong

Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point…

Computer Vision and Pattern Recognition · Computer Science 2024-07-08 Shivam Aggarwal , Hans Jakob Damsgaard , Alessandro Pappalardo , Giuseppe Franco , Thomas B. Preußer , Michaela Blott , Tulika Mitra

Tiny machine learning (tinyML) has emerged during the past few years aiming to deploy machine learning models to embedded AI processors with highly constrained memory and computation capacity. Low precision quantization is an important…

We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the…

Machine Learning · Computer Science 2021-03-02 Angela Fan , Pierre Stock , Benjamin Graham , Edouard Grave , Remi Gribonval , Herve Jegou , Armand Joulin

Quantization techniques can reduce the size of Deep Neural Networks and improve inference latency and throughput by taking advantage of high throughput integer instructions. In this paper we review the mathematical aspects of quantization…

Machine Learning · Computer Science 2020-04-22 Hao Wu , Patrick Judd , Xiaojie Zhang , Mikhail Isaev , Paulius Micikevicius
‹ Prev 1 2 3 10 Next ›