English

MRQ:Support Multiple Quantization Schemes through Model Re-Quantization

Machine Learning 2023-08-07 v2 Computer Vision and Pattern Recognition

Abstract

Despite the proliferation of diverse hardware accelerators (e.g., NPU, TPU, DPU), deploying deep learning models on edge devices with fixed-point hardware is still challenging due to complex model quantization and conversion. Existing model quantization frameworks like Tensorflow QAT [1], TFLite PTQ [2], and Qualcomm AIMET [3] supports only a limited set of quantization schemes (e.g., only asymmetric per-tensor quantization in TF1.x QAT [4]). Accordingly, deep learning models cannot be easily quantized for diverse fixed-point hardwares, mainly due to slightly different quantization requirements. In this paper, we envision a new type of model quantization approach called MRQ (model re-quantization), which takes existing quantized models and quickly transforms the models to meet different quantization requirements (e.g., asymmetric -> symmetric, non-power-of-2 scale -> power-of-2 scale). Re-quantization is much simpler than quantizing from scratch because it avoids costly re-training and provides support for multiple quantization schemes simultaneously. To minimize re-quantization error, we developed a new set of re-quantization algorithms including weight correction and rounding error folding. We have demonstrated that MobileNetV2 QAT model [7] can be quickly re-quantized into two different quantization schemes (i.e., symmetric and symmetric+power-of-2 scale) with less than 0.64 units of accuracy loss. We believe our work is the first to leverage this concept of re-quantization for model quantization and models obtained from the re-quantization process have been successfully deployed on NNA in the Echo Show devices.

Keywords

Cite

@article{arxiv.2308.01867,
  title  = {MRQ:Support Multiple Quantization Schemes through Model Re-Quantization},
  author = {Manasa Manohara and Sankalp Dayal and Tariq Afzal and Rahul Bakshi and Kahkuen Fu},
  journal= {arXiv preprint arXiv:2308.01867},
  year   = {2023}
}

Comments

8 pages, 6 figures, 3 tables, TinyML Conference