Related papers: Teacher-Guided Student Self-Knowledge Distillation…

Knowledge Diffusion for Distillation

The representation gap between teacher and student is an emerging topic in knowledge distillation (KD). To reduce the gap and improve the performance, current methods often resort to complicated training schemes, loss functions, and feature…

Computer Vision and Pattern Recognition · Computer Science 2023-12-05 Tao Huang , Yuan Zhang , Mingkai Zheng , Shan You , Fei Wang , Chen Qian , Chang Xu

Knowledge Distillation with Deep Supervision

Knowledge distillation aims to enhance the performance of a lightweight student model by exploiting the knowledge from a pre-trained cumbersome teacher model. However, in the traditional knowledge distillation, teacher predictions are only…

Machine Learning · Computer Science 2023-05-26 Shiya Luo , Defang Chen , Can Wang

Improving Knowledge Distillation via Regularizing Feature Norm and Direction

Knowledge distillation (KD) exploits a large well-trained model (i.e., teacher) to train a small student model on the same dataset for the same task. Treating teacher features as knowledge, prevailing methods of knowledge distillation train…

Computer Vision and Pattern Recognition · Computer Science 2023-05-29 Yuzhu Wang , Lechao Cheng , Manni Duan , Yongheng Wang , Zunlei Feng , Shu Kong

Modeling Teacher-Student Techniques in Deep Neural Networks for Knowledge Distillation

Knowledge distillation (KD) is a new method for transferring knowledge of a structure under training to another one. The typical application of KD is in the form of learning a small model (named as a student) by soft labels produced by a…

Computer Vision and Pattern Recognition · Computer Science 2020-01-01 Sajjad Abbasi , Mohsen Hajabdollahi , Nader Karimi , Shadrokh Samavi

Extracting knowledge from features with multilevel abstraction

Knowledge distillation aims at transferring the knowledge from a large teacher model to a small student model with great improvements of the performance of the student model. Therefore, the student network can replace the teacher network to…

Machine Learning · Computer Science 2021-12-28 Jinhong Lin , Zhaoyang Li

Distilling Knowledge by Mimicking Features

Knowledge distillation (KD) is a popular method to train efficient networks ("student") with the help of high-capacity networks ("teacher"). Traditional methods use the teacher's soft logits as extra supervision to train the student…

Computer Vision and Pattern Recognition · Computer Science 2021-08-17 Guo-Hua Wang , Yifan Ge , Jianxin Wu

An Embarrassingly Simple Approach for Knowledge Distillation

Knowledge Distillation (KD) aims at improving the performance of a low-capacity student model by inheriting knowledge from a high-capacity teacher model. Previous KD methods typically train a student by minimizing a task-related loss and…

Computer Vision and Pattern Recognition · Computer Science 2019-09-10 Mengya Gao , Yujun Shen , Quanquan Li , Junjie Yan , Liang Wan , Dahua Lin , Chen Change Loy , Xiaoou Tang

Knowledge Distillation Using Hierarchical Self-Supervision Augmented Distribution

Knowledge distillation (KD) is an effective framework that aims to transfer meaningful information from a large teacher to a smaller student. Generally, KD often involves how to define and transfer knowledge. Previous KD methods often focus…

Computer Vision and Pattern Recognition · Computer Science 2022-07-26 Chuanguang Yang , Zhulin An , Linhang Cai , Yongjun Xu

Lightweight Self-Knowledge Distillation with Multi-source Information Fusion

Knowledge Distillation (KD) is a powerful technique for transferring knowledge between neural network models, where a pre-trained teacher model is used to facilitate the training of the target student model. However, the availability of a…

Computer Vision and Pattern Recognition · Computer Science 2023-05-17 Xucong Wang , Pengchao Han , Lei Guo

Refine Myself by Teaching Myself: Feature Refinement via Self-Knowledge Distillation

Knowledge distillation is a method of transferring the knowledge from a pretrained complex teacher model to a student model, so a smaller network can replace a large teacher network at the deployment stage. To reduce the necessity of…

Computer Vision and Pattern Recognition · Computer Science 2021-03-16 Mingi Ji , Seungjae Shin , Seunghyun Hwang , Gibeom Park , Il-Chul Moon

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

Recent advances in knowledge distillation (KD) have enabled smaller student models to approach the performance of larger teacher models. However, popular methods such as supervised KD and on-policy KD, are adversely impacted by the…

Computation and Language · Computer Science 2025-04-29 Wenda Xu , Rujun Han , Zifeng Wang , Long T. Le , Dhruv Madeka , Lei Li , William Yang Wang , Rishabh Agarwal , Chen-Yu Lee , Tomas Pfister

Knowledge Distillation Layer that Lets the Student Decide

Typical technique in knowledge distillation (KD) is regularizing the learning of a limited capacity model (student) by pushing its responses to match a powerful model's (teacher). Albeit useful especially in the penultimate layer and…

Computer Vision and Pattern Recognition · Computer Science 2023-09-08 Ada Gorgun , Yeti Z. Gurbuz , A. Aydin Alatan

Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models

Knowledge distillation (KD) is a technique for transferring knowledge from complex teacher models to simpler student models, significantly enhancing model efficiency and accuracy. It has demonstrated substantial advancements in various…

Computation and Language · Computer Science 2025-04-21 Junjie Yang , Junhao Song , Xudong Han , Ziqian Bi , Tianyang Wang , Chia Xin Liang , Xinyuan Song , Yichao Zhang , Qian Niu , Benji Peng , Keyu Chen , Ming Liu

Discriminative and Consistent Representation Distillation

Knowledge Distillation (KD) aims to transfer knowledge from a large teacher model to a smaller student model. While contrastive learning has shown promise in self-supervised learning by creating discriminative representations, its…

Computer Vision and Pattern Recognition · Computer Science 2025-05-14 Nikolaos Giakoumoglou , Tania Stathaki

Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge Distillation

Knowledge distillation (KD) has shown very promising capabilities in transferring learning representations from large models (teachers) to small models (students). However, as the capacity gap between students and teachers becomes larger,…

Computer Vision and Pattern Recognition · Computer Science 2023-03-24 Zengyu Qiu , Xinzhu Ma , Kunlin Yang , Chunya Liu , Jun Hou , Shuai Yi , Wanli Ouyang

LAKD-Activation Mapping Distillation Based on Local Learning

Knowledge distillation is widely applied in various fundamental vision models to enhance the performance of compact models. Existing knowledge distillation methods focus on designing different distillation targets to acquire knowledge from…

Computer Vision and Pattern Recognition · Computer Science 2024-08-23 Yaoze Zhang , Yuming Zhang , Yu Zhao , Yue Zhang , Feiyu Zhu

Knowledge Distillation Beyond Model Compression

Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or an ensemble of models (teacher). Various…

Machine Learning · Computer Science 2020-07-08 Fahad Sarfraz , Elahe Arani , Bahram Zonooz

Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods

Knowledge distillation (KD) is an effective method for model compression and transferring knowledge between models. However, its effect on model's robustness against spurious correlations that degrade performance on out-of-distribution data…

Machine Learning · Computer Science 2025-10-31 Jiali Cheng , Chirag Agarwal , Hadi Amiri

From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels

Knowledge Distillation (KD) uses the teacher's prediction logits as soft labels to guide the student, while self-KD does not need a real teacher to require the soft labels. This work unifies the formulations of the two tasks by decomposing…

Computer Vision and Pattern Recognition · Computer Science 2023-07-18 Zhendong Yang , Ailing Zeng , Zhe Li , Tianke Zhang , Chun Yuan , Yu Li

Knowledge Distillation for Speech Denoising by Latent Representation Alignment with Cosine Distance

Speech denoising is a generally adopted and impactful task, appearing in many common and everyday-life use cases. Although there are very powerful methods published, most of those are too complex for deployment in everyday and low-resources…

Sound · Computer Science 2025-05-07 Diep Luong , Mikko Heikkinen , Konstantinos Drossos , Tuomas Virtanen