Related papers: Light Multi-segment Activation for Model Compressi…

Activation Map Adaptation for Effective Knowledge Distillation

Model compression becomes a recent trend due to the requirement of deploying neural networks on embedded and mobile devices. Hence, both accuracy and efficiency are of critical importance. To explore a balance between them, a knowledge…

Computer Vision and Pattern Recognition · Computer Science 2022-04-15 Zhiyuan Wu , Hong Qi , Yu Jiang , Minghao Zhao , Chupeng Cui , Zongmin Yang , Xinhui Xue

Knowledge Distillation with the Reused Teacher Classifier

Knowledge distillation aims to compress a powerful yet cumbersome teacher model into a lightweight student model without much sacrifice of performance. For this purpose, various approaches have been proposed over the past few years,…

Computer Vision and Pattern Recognition · Computer Science 2022-03-29 Defang Chen , Jian-Ping Mei , Hailin Zhang , Can Wang , Yan Feng , Chun Chen

Model Compression with Multi-Task Knowledge Distillation for Web-scale Question Answering System

Deep pre-training and fine-tuning models (like BERT, OpenAI GPT) have demonstrated excellent results in question answering areas. However, due to the sheer amount of model parameters, the inference speed of these models is very slow. How to…

Computation and Language · Computer Science 2019-04-23 Ze Yang , Linjun Shou , Ming Gong , Wutao Lin , Daxin Jiang

LAKD-Activation Mapping Distillation Based on Local Learning

Knowledge distillation is widely applied in various fundamental vision models to enhance the performance of compact models. Existing knowledge distillation methods focus on designing different distillation targets to acquire knowledge from…

Computer Vision and Pattern Recognition · Computer Science 2024-08-23 Yaoze Zhang , Yuming Zhang , Yu Zhao , Yue Zhang , Feiyu Zhu

Augmenting Knowledge Distillation With Peer-To-Peer Mutual Learning For Model Compression

Knowledge distillation (KD) is an effective model compression technique where a compact student network is taught to mimic the behavior of a complex and highly trained teacher network. In contrast, Mutual Learning (ML) provides an…

Computer Vision and Pattern Recognition · Computer Science 2021-10-25 Usma Niyaz , Deepti R. Bathula

Multi-head Knowledge Distillation for Model Compression

Several methods of knowledge distillation have been developed for neural network compression. While they all use the KL divergence loss to align the soft outputs of the student model more closely with that of the teacher, the various…

Computer Vision and Pattern Recognition · Computer Science 2020-12-08 Huan Wang , Suhas Lohit , Michael Jones , Yun Fu

Model Compression Using Optimal Transport

Model compression methods are important to allow for easier deployment of deep learning models in compute, memory and energy-constrained environments such as mobile phones. Knowledge distillation is a class of model compression algorithm…

Computer Vision and Pattern Recognition · Computer Science 2020-12-08 Suhas Lohit , Michael Jones

An Efficient Method of Training Small Models for Regression Problems with Knowledge Distillation

Compressing deep neural network (DNN) models becomes a very important and necessary technique for real-world applications, such as deploying those models on mobile devices. Knowledge distillation is one of the most popular methods for model…

Machine Learning · Computer Science 2020-03-02 Makoto Takamoto , Yusuke Morishita , Hitoshi Imaoka

One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers

Pre-trained language models (PLMs) achieve great success in NLP. However, their huge model sizes hinder their applications in many practical systems. Knowledge distillation is a popular technique to compress PLMs, which learns a small…

Computation and Language · Computer Science 2021-06-03 Chuhan Wu , Fangzhao Wu , Yongfeng Huang

Reinforced Multi-Teacher Selection for Knowledge Distillation

In natural language processing (NLP) tasks, slow inference speed and huge footprints in GPU usage remain the bottleneck of applying pre-trained deep models in production. As a popular method for model compression, knowledge distillation…

Computation and Language · Computer Science 2020-12-15 Fei Yuan , Linjun Shou , Jian Pei , Wutao Lin , Ming Gong , Yan Fu , Daxin Jiang

Exploring the Limits of Model Compression in LLMs: A Knowledge Distillation Study on QA Tasks

Large Language Models (LLMs) have demonstrated outstanding performance across a range of NLP tasks, however, their computational demands hinder their deployment in real-world, resource-constrained environments. This work investigates the…

Computation and Language · Computer Science 2025-07-11 Joyeeta Datta , Niclas Doll , Qusai Ramadan , Zeyd Boukhers

Patient Knowledge Distillation for BERT Model Compression

Pre-trained language models such as BERT have proven to be highly effective for natural language processing (NLP) tasks. However, the high demand for computing resources in training such models hinders their application in practice. In…

Computation and Language · Computer Science 2019-08-27 Siqi Sun , Yu Cheng , Zhe Gan , Jingjing Liu

Model Compression with Two-stage Multi-teacher Knowledge Distillation for Web Question Answering System

Deep pre-training and fine-tuning models (such as BERT and OpenAI GPT) have demonstrated excellent results in question answering areas. However, due to the sheer amount of model parameters, the inference speed of these models is very slow.…

Computation and Language · Computer Science 2019-10-21 Ze Yang , Linjun Shou , Ming Gong , Wutao Lin , Daxin Jiang

LLM on a Budget: Active Knowledge Distillation for Efficient Classification of Large Text Corpora

Large Language Models (LLMs) are highly accurate in classification tasks, however, substantial computational and financial costs hinder their large-scale deployment in dynamic environments. Knowledge Distillation (KD) where a LLM "teacher"…

Machine Learning · Computer Science 2025-11-18 Viviana Luccioli , Rithika Iyengar , Ryan Panley , Flora Haberkorn , Xiaoyu Ge , Leland Crane , Nitish Sinha , Seung Jung Lee

MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation

Knowledge distillation is a key technique for compressing large language models (LLMs), but most existing methods align representations at fixed layers or token-level outputs, ignoring how representations evolve across depth. As a result,…

Computation and Language · Computer Science 2026-05-05 Pham Khanh Chi , Quoc Phong Dao , Thuat Nguyen , Linh Ngo Van , Trung Le , Thanh Hong Nguyen

MKD: a Multi-Task Knowledge Distillation Approach for Pretrained Language Models

Pretrained language models have led to significant performance gains in many NLP tasks. However, the intensive computing resources to train such models remain an issue. Knowledge distillation alleviates this problem by learning a…

Computation and Language · Computer Science 2020-05-04 Linqing Liu , Huan Wang , Jimmy Lin , Richard Socher , Caiming Xiong

Few Sample Knowledge Distillation for Efficient Network Compression

Deep neural network compression techniques such as pruning and weight tensor decomposition usually require fine-tuning to recover the prediction accuracy when the compression ratio is high. However, conventional fine-tuning suffers from the…

Machine Learning · Computer Science 2020-04-01 Tianhong Li , Jianguo Li , Zhuang Liu , Changshui Zhang

Soft Knowledge Distillation with Multi-Dimensional Cross-Net Attention for Image Restoration Models Compression

Transformer-based encoder-decoder models have achieved remarkable success in image-to-image transfer tasks, particularly in image restoration. However, their high computational complexity-manifested in elevated FLOPs and parameter…

Computer Vision and Pattern Recognition · Computer Science 2025-01-17 Yongheng Zhang , Danfeng Yan

Model Distillation with Knowledge Transfer from Face Classification to Alignment and Verification

Knowledge distillation is a potential solution for model compression. The idea is to make a small student network imitate the target of a large teacher network, then the student network can be competitive to the teacher one. Most previous…

Computer Vision and Pattern Recognition · Computer Science 2017-10-24 Chong Wang , Xipeng Lan , Yangang Zhang

Online Ensemble Model Compression using Knowledge Distillation

This paper presents a novel knowledge distillation based model compression framework consisting of a student ensemble. It enables distillation of simultaneously learnt ensemble knowledge onto each of the compressed student models. Each…

Computer Vision and Pattern Recognition · Computer Science 2020-11-17 Devesh Walawalkar , Zhiqiang Shen , Marios Savvides