Related papers: Switchable Online Knowledge Distillation

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

Recent advances in knowledge distillation (KD) have enabled smaller student models to approach the performance of larger teacher models. However, popular methods such as supervised KD and on-policy KD, are adversely impacted by the…

Computation and Language · Computer Science 2025-04-29 Wenda Xu , Rujun Han , Zifeng Wang , Long T. Le , Dhruv Madeka , Lei Li , William Yang Wang , Rishabh Agarwal , Chen-Yu Lee , Tomas Pfister

Student-Oriented Teacher Knowledge Refinement for Knowledge Distillation

Knowledge distillation has become widely recognized for its ability to transfer knowledge from a large teacher network to a compact and more streamlined student network. Traditional knowledge distillation methods primarily follow a…

Computer Vision and Pattern Recognition · Computer Science 2024-09-30 Chaomin Shen , Yaomin Huang , Haokun Zhu , Jinsong Fan , Guixu Zhang

Online Knowledge Distillation with Diverse Peers

Distillation is an effective knowledge-transfer technique that uses predicted distributions of a powerful teacher model as soft targets to train a less-parameterized student model. A pre-trained high capacity teacher, however, is not always…

Machine Learning · Computer Science 2019-12-06 Defang Chen , Jian-Ping Mei , Can Wang , Yan Feng , Chun Chen

An Embarrassingly Simple Approach for Knowledge Distillation

Knowledge Distillation (KD) aims at improving the performance of a low-capacity student model by inheriting knowledge from a high-capacity teacher model. Previous KD methods typically train a student by minimizing a task-related loss and…

Computer Vision and Pattern Recognition · Computer Science 2019-09-10 Mengya Gao , Yujun Shen , Quanquan Li , Junjie Yan , Liang Wan , Dahua Lin , Chen Change Loy , Xiaoou Tang

Improving Knowledge Distillation via Transferring Learning Ability

Existing knowledge distillation methods generally use a teacher-student approach, where the student network solely learns from a well-trained teacher. However, this approach overlooks the inherent differences in learning abilities between…

Computer Vision and Pattern Recognition · Computer Science 2023-09-19 Long Liu , Tong Li , Hui Cheng

Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models

Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them. The success of KD in auto-regressive language models mainly relies on Reverse KL for mode-seeking and…

Computation and Language · Computer Science 2024-09-23 Jun Rao , Xuebo Liu , Zepeng Lin , Liang Ding , Jing Li , Dacheng Tao , Min Zhang

Spot-adaptive Knowledge Distillation

Knowledge distillation (KD) has become a well established paradigm for compressing deep neural networks. The typical way of conducting knowledge distillation is to train the student network under the supervision of the teacher network to…

Computer Vision and Pattern Recognition · Computer Science 2022-05-06 Jie Song , Ying Chen , Jingwen Ye , Mingli Song

Subclass Knowledge Distillation with Known Subclass Labels

This work introduces a novel knowledge distillation framework for classification tasks where information on existing subclasses is available and taken into consideration. In classification tasks with a small number of classes or binary…

Machine Learning · Computer Science 2022-07-19 Ahmad Sajedi , Yuri A. Lawryshyn , Konstantinos N. Plataniotis

MoKD: Multi-Task Optimization for Knowledge Distillation

Compact models can be effectively trained through Knowledge Distillation (KD), a technique that transfers knowledge from larger, high-performing teacher models. Two key challenges in Knowledge Distillation (KD) are: 1) balancing learning…

Computer Vision and Pattern Recognition · Computer Science 2025-08-05 Zeeshan Hayder , Ali Cheraghian , Lars Petersson , Mehrtash Harandi

Semi-Online Knowledge Distillation

Knowledge distillation is an effective and stable method for model compression via knowledge transfer. Conventional knowledge distillation (KD) is to transfer knowledge from a large and well pre-trained teacher network to a small student…

Computer Vision and Pattern Recognition · Computer Science 2021-11-24 Zhiqiang Liu , Yanxia Liu , Chengkai Huang

Adaptive Teaching with Shared Classifier for Knowledge Distillation

Knowledge distillation (KD) is a technique used to transfer knowledge from an overparameterized teacher network to a less-parameterized student network, thereby minimizing the incurred performance loss. KD methods can be categorized into…

Computer Vision and Pattern Recognition · Computer Science 2024-06-17 Jaeyeon Jang , Young-Ik Kim , Jisu Lim , Hyeonseong Lee

Knowledge distillation is a popular paradigm for learning portable neural networks by transferring the knowledge from a large model into a smaller one. Most existing approaches enhance the student model by utilizing the similarity…

Computer Vision and Pattern Recognition · Computer Science 2021-03-19 Haoran Zhao , Kun Gong , Xin Sun , Junyu Dong , Hui Yu

Partial to Whole Knowledge Distillation: Progressive Distilling Decomposed Knowledge Boosts Student Better

Knowledge distillation field delicately designs various types of knowledge to shrink the performance gap between compact student and large-scale teacher. These existing distillation approaches simply focus on the improvement of…

Computer Vision and Pattern Recognition · Computer Science 2021-09-28 Xuanyang Zhang , Xiangyu Zhang , Jian Sun

On the Efficiency of Subclass Knowledge Distillation in Classification Tasks

This work introduces a novel knowledge distillation framework for classification tasks where information on existing subclasses is available and taken into consideration. In classification tasks with a small number of classes or binary…

Machine Learning · Computer Science 2022-07-06 Ahmad Sajedi , Konstantinos N. Plataniotis

Gradient Knowledge Distillation for Pre-trained Language Models

Knowledge distillation (KD) is an effective framework to transfer knowledge from a large-scale teacher to a compact yet well-performing student. Previous KD practices for pre-trained language models mainly transfer knowledge by aligning…

Computation and Language · Computer Science 2022-11-03 Lean Wang , Lei Li , Xu Sun

Direct Distillation between Different Domains

Knowledge Distillation (KD) aims to learn a compact student network using knowledge from a large pre-trained teacher network, where both networks are trained on data from the same distribution. However, in practical applications, the…

Machine Learning · Computer Science 2024-01-17 Jialiang Tang , Shuo Chen , Gang Niu , Hongyuan Zhu , Joey Tianyi Zhou , Chen Gong , Masashi Sugiyama

ShiftKD: Benchmarking Knowledge Distillation under Distribution Shift

Knowledge Distillation (KD) transfers knowledge from large models to small models and has recently achieved remarkable success. However, the reliability of existing KD methods in real-world applications, especially under distribution shift,…

Machine Learning · Computer Science 2025-07-22 Songming Zhang , Yuxiao Luo , Ziyu Lyu , Xiaofeng Chen

Reducing the Teacher-Student Gap via Spherical Knowledge Disitllation

Knowledge distillation aims at obtaining a compact and effective model by learning the mapping function from a much larger one. Due to the limited capacity of the student, the student would underfit the teacher. Therefore, student…

Machine Learning · Computer Science 2021-01-13 Jia Guo , Minghao Chen , Yao Hu , Chen Zhu , Xiaofei He , Deng Cai

Swapped Logit Distillation via Bi-level Teacher Alignment

Knowledge distillation (KD) compresses the network capacity by transferring knowledge from a large (teacher) network to a smaller one (student). It has been mainstream that the teacher directly transfers knowledge to the student with its…

Machine Learning · Computer Science 2025-05-26 Stephen Ekaputra Limantoro , Jhe-Hao Lin , Chih-Yu Wang , Yi-Lung Tsai , Hong-Han Shuai , Ching-Chun Huang , Wen-Huang Cheng

Understanding and Improving Knowledge Distillation

Knowledge Distillation (KD) is a model-agnostic technique to improve model quality while having a fixed capacity budget. It is a commonly used technique for model compression, where a larger capacity teacher model with better quality is…

Machine Learning · Computer Science 2021-03-02 Jiaxi Tang , Rakesh Shivanna , Zhe Zhao , Dong Lin , Anima Singh , Ed H. Chi , Sagar Jain