Related papers: Selective Cross-Task Distillation

Knowledge Distillation with the Reused Teacher Classifier

Knowledge distillation aims to compress a powerful yet cumbersome teacher model into a lightweight student model without much sacrifice of performance. For this purpose, various approaches have been proposed over the past few years,…

Computer Vision and Pattern Recognition · Computer Science 2022-03-29 Defang Chen , Jian-Ping Mei , Hailin Zhang , Can Wang , Yan Feng , Chun Chen

Improved Knowledge Distillation for Pre-trained Language Models via Knowledge Selection

Knowledge distillation addresses the problem of transferring knowledge from a teacher model to a student model. In this process, we typically have multiple types of knowledge extracted from the teacher model. The problem is to make full use…

Computation and Language · Computer Science 2023-02-02 Chenglong Wang , Yi Lu , Yongyu Mu , Yimin Hu , Tong Xiao , Jingbo Zhu

Unified and Effective Ensemble Knowledge Distillation

Ensemble knowledge distillation can extract knowledge from multiple teacher models and encode it into a single student model. Many existing methods learn and distill the student model on labeled data only. However, the teacher models are…

Machine Learning · Computer Science 2022-04-04 Chuhan Wu , Fangzhao Wu , Tao Qi , Yongfeng Huang

Can a student Large Language Model perform as well as it's teacher?

The burgeoning complexity of contemporary deep learning models, while achieving unparalleled accuracy, has inadvertently introduced deployment challenges in resource-constrained environments. Knowledge distillation, a technique aiming to…

Machine Learning · Computer Science 2023-10-05 Sia Gholami , Marwan Omar

Structural Knowledge Distillation: Tractably Distilling Information for Structured Predictor

Knowledge distillation is a critical technique to transfer knowledge between models, typically from a large model (the teacher) to a more fine-grained one (the student). The objective function of knowledge distillation is typically the…

Computation and Language · Computer Science 2021-06-03 Xinyu Wang , Yong Jiang , Zhaohui Yan , Zixia Jia , Nguyen Bach , Tao Wang , Zhongqiang Huang , Fei Huang , Kewei Tu

Multi-Label Knowledge Distillation

Existing knowledge distillation methods typically work by imparting the knowledge of output logits or intermediate feature maps from the teacher network to the student network, which is very successful in multi-class single-label learning.…

Machine Learning · Computer Science 2025-06-02 Penghui Yang , Ming-Kun Xie , Chen-Chen Zong , Lei Feng , Gang Niu , Masashi Sugiyama , Sheng-Jun Huang

Contrastive Representation Distillation

Often we wish to transfer representational knowledge from one neural network to another. Examples include distilling a large network into a smaller one, transferring knowledge from one sensory modality to a second, or ensembling a…

Machine Learning · Computer Science 2022-01-26 Yonglong Tian , Dilip Krishnan , Phillip Isola

Knowledge Distillation in Wide Neural Networks: Risk Bound, Data Efficiency and Imperfect Teacher

Knowledge distillation is a strategy of training a student network with guide of the soft output from a teacher network. It has been a successful method of model compression and knowledge transfer. However, currently knowledge distillation…

Machine Learning · Computer Science 2024-10-21 Guangda Ji , Zhanxing Zhu

Reinforced Multi-Teacher Selection for Knowledge Distillation

In natural language processing (NLP) tasks, slow inference speed and huge footprints in GPU usage remain the bottleneck of applying pre-trained deep models in production. As a popular method for model compression, knowledge distillation…

Computation and Language · Computer Science 2020-12-15 Fei Yuan , Linjun Shou , Jian Pei , Wutao Lin , Ming Gong , Yan Fu , Daxin Jiang

Prototype-guided Cross-task Knowledge Distillation for Large-scale Models

Recently, large-scale pre-trained models have shown their advantages in many tasks. However, due to the huge computational complexity and storage requirements, it is challenging to apply the large-scale model to real scenes. A common…

Computer Vision and Pattern Recognition · Computer Science 2022-12-27 Deng Li , Aming Wu , Yahong Han , Qi Tian

Knowledge distillation is a widely applicable technique for training a student neural network under the guidance of a trained teacher network. For example, in neural network compression, a high-capacity teacher is distilled to train a…

Computer Vision and Pattern Recognition · Computer Science 2019-08-05 Frederick Tung , Greg Mori

Cross-Modal Distillation For Widely Differing Modalities

Deep learning achieved great progress recently, however, it is not easy or efficient to further improve its performance by increasing the size of the model. Multi-modal learning can mitigate this challenge by introducing richer and more…

Artificial Intelligence · Computer Science 2025-10-07 Cairong Zhao , Yufeng Jin , Zifan Song , Haonan Chen , Duoqian Miao , Guosheng Hu

Highlight Every Step: Knowledge Distillation via Collaborative Teaching

High storage and computational costs obstruct deep neural networks to be deployed on resource-constrained devices. Knowledge distillation aims to train a compact student network by transferring knowledge from a larger pre-trained teacher…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Haoran Zhao , Xin Sun , Junyu Dong , Changrui Chen , Zihe Dong

Distilling Knowledge via Intermediate Classifiers

The crux of knowledge distillation is to effectively train a resource-limited student model with the guide of a pre-trained larger teacher model. However, when there is a large difference between the model complexities of teacher and…

Machine Learning · Computer Science 2021-06-01 Aryan Asadian , Amirali Salehi-Abari

Fixing the Teacher-Student Knowledge Discrepancy in Distillation

Training a small student network with the guidance of a larger teacher network is an effective way to promote the performance of the student. Despite the different types, the guided knowledge used to distill is always kept unchanged for…

Computer Vision and Pattern Recognition · Computer Science 2021-04-01 Jiangfan Han , Mengya Gao , Yujie Wang , Quanquan Li , Hongsheng Li , Xiaogang Wang

Confidence-Aware Multi-Teacher Knowledge Distillation

Knowledge distillation is initially introduced to utilize additional supervision from a single teacher model for the student model training. To boost the student performance, some recent variants attempt to exploit diverse knowledge sources…

Machine Learning · Computer Science 2022-02-15 Hailin Zhang , Defang Chen , Can Wang

Distilling Knowledge via Knowledge Review

Knowledge distillation transfers knowledge from the teacher network to the student one, with the goal of greatly improving the performance of the student network. Previous methods mostly focus on proposing feature transformation and loss…

Computer Vision and Pattern Recognition · Computer Science 2021-04-20 Pengguang Chen , Shu Liu , Hengshuang Zhao , Jiaya Jia

Revisiting Knowledge Distillation for Object Detection

The existing solutions for object detection distillation rely on the availability of both a teacher model and ground-truth labels. We propose a new perspective to relax this constraint. In our framework, a student is first trained with…

Computer Vision and Pattern Recognition · Computer Science 2021-05-25 Amin Banitalebi-Dehkordi

Student Network Learning via Evolutionary Knowledge Distillation

Knowledge distillation provides an effective way to transfer knowledge via teacher-student learning, where most existing distillation approaches apply a fixed pre-trained model as teacher to supervise the learning of student network. This…

Machine Learning · Computer Science 2021-03-26 Kangkai Zhang , Chunhui Zhang , Shikun Li , Dan Zeng , Shiming Ge

Cross-Task Knowledge Distillation in Multi-Task Recommendation

Multi-task learning (MTL) has been widely used in recommender systems, wherein predicting each type of user feedback on items (e.g, click, purchase) are treated as individual tasks and jointly trained with a unified model. Our key…

Information Retrieval · Computer Science 2022-03-29 Chenxiao Yang , Junwei Pan , Xiaofeng Gao , Tingyu Jiang , Dapeng Liu , Guihai Chen