Related papers: Pruning Ternary Quantization
Inference time, model size, and accuracy are critical for deploying deep neural network models. Numerous research efforts have been made to compress neural network models with faster inference and higher accuracy. Pruning and quantization…
Model quantization enables the deployment of deep neural networks under resource-constrained devices. Vector quantization aims at reducing the model size by indexing model weights with full-precision embeddings, i.e., codewords, while the…
Deep neural networks are widely used in machine learning applications. However, the deployment of large neural networks models can be difficult to deploy on mobile devices with limited power budgets. To solve this problem, we propose…
Model compression is instrumental in optimizing deep neural network inference on resource-constrained hardware. The prevailing methods for network compression, namely quantization and pruning, have been shown to enhance efficiency at the…
Post-training quantization (PTQ) is a neural network compression technique that converts a full-precision model into a quantized model using lower-precision data types. Although it can help reduce the size and computational cost of deep…
Model compression has gained a lot of attention due to its ability to reduce hardware resource requirements significantly while maintaining accuracy of DNNs. Model compression is especially useful for memory-intensive recurrent neural…
We propose a novel fine-grained quantization (FGQ) method to ternarize pre-trained full precision models, while also constraining activations to 8 and 4-bits. Using this method, we demonstrate a minimal loss in classification accuracy on…
Deep neural networks have been applied in many applications exhibiting extraordinary abilities in the field of computer vision. However, complex network architectures challenge efficient real-time deployment and require significant…
Neural network models are resource hungry. It is difficult to deploy such deep networks on devices with limited resources, like smart wearables, cellphones, drones, and autonomous vehicles. Low bit quantization such as binary and ternary…
Deep Neural Networks (DNNs) have achieved significant advances in a wide range of applications. However, their deployment on resource-constrained devices remains a challenge due to the large number of layers and parameters, which result in…
Model quantization reduces neural network parameter precision to achieve compression, but often compromises accuracy. Existing post-training quantization (PTQ) methods employ iterative parameter updates to preserve accuracy under high…
As Deep Neural Networks (DNNs) usually are overparameterized and have millions of weight parameters, it is challenging to deploy these large DNN models on resource-constrained hardware platforms, e.g., smartphones. Numerous network…
In the traditional deep compression framework, iteratively performing network pruning and quantization can reduce the model size and computation cost to meet the deployment requirements. However, such a step-wise application of pruning and…
Neural network quantization and pruning are two techniques commonly used to reduce the computational complexity and memory footprint of these models for deployment. However, most existing pruning strategies operate on full-precision and…
Quantization of weights of deep neural networks (DNN) has proven to be an effective solution for the purpose of implementing DNNs on edge devices such as mobiles, ASICs and FPGAs, because they have no sufficient resources to support…
Model architectures have been dramatically increasing in size, improving performance at the cost of resource requirements. In this paper we propose 3DQ, a ternary quantization method, applied for the first time to 3D Fully Convolutional…
This paper presents a novel network compression framework Kernel Quantization (KQ), targeting to efficiently convert any pre-trained full-precision convolutional neural network (CNN) model into a low-precision version without significant…
Quantization and pruning are two effective Deep Neural Networks model compression methods. In this paper, we propose Automatic Prune Binarization (APB), a novel compression technique combining quantization with pruning. APB enhances the…
Deep Neural Networks(DNNs) have many parameters and activation data, and these both are expensive to implement. One method to reduce the size of the DNN is to quantize the pre-trained model by using a low-bit expression for weights and…
Diffusion models have achieved great success in image synthesis through iterative noise estimation using deep neural networks. However, the slow inference, high memory consumption, and computation intensity of the noise estimation model…