Related papers: Learning Interpretation with Explainable Knowledge…

Good Teachers Explain: Explanation-Enhanced Knowledge Distillation

Knowledge Distillation (KD) has proven effective for compressing large teacher models into smaller student models. While it is well known that student models can achieve similar accuracies as the teachers, it has also been shown that they…

Computer Vision and Pattern Recognition · Computer Science 2024-07-23 Amin Parchami-Araghi , Moritz Böhle , Sukrut Rao , Bernt Schiele

Improving Knowledge Distillation with Teacher's Explanation

Knowledge distillation (KD) improves the performance of a low-complexity student model with the help of a more powerful teacher. The teacher in KD is a black-box model, imparting knowledge to the student only through its predictions. This…

Machine Learning · Computer Science 2023-10-05 Sayantan Chowdhury , Ben Liang , Ali Tizghadam , Ilijc Albanese

Knowledge Distillation Beyond Model Compression

Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or an ensemble of models (teacher). Various…

Machine Learning · Computer Science 2020-07-08 Fahad Sarfraz , Elahe Arani , Bahram Zonooz

On the Impact of Knowledge Distillation for Model Interpretability

Several recent studies have elucidated why knowledge distillation (KD) improves model performance. However, few have researched the other advantages of KD in addition to its improving model performance. In this study, we have attempted to…

Machine Learning · Computer Science 2023-05-26 Hyeongrok Han , Siwon Kim , Hyun-Soo Choi , Sungroh Yoon

Modeling Teacher-Student Techniques in Deep Neural Networks for Knowledge Distillation

Knowledge distillation (KD) is a new method for transferring knowledge of a structure under training to another one. The typical application of KD is in the form of learning a small model (named as a student) by soft labels produced by a…

Computer Vision and Pattern Recognition · Computer Science 2020-01-01 Sajjad Abbasi , Mohsen Hajabdollahi , Nader Karimi , Shadrokh Samavi

Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models

Knowledge distillation (KD) is a technique for transferring knowledge from complex teacher models to simpler student models, significantly enhancing model efficiency and accuracy. It has demonstrated substantial advancements in various…

Computation and Language · Computer Science 2025-04-21 Junjie Yang , Junhao Song , Xudong Han , Ziqian Bi , Tianyang Wang , Chia Xin Liang , Xinyuan Song , Yichao Zhang , Qian Niu , Benji Peng , Keyu Chen , Ming Liu

Improved knowledge distillation by utilizing backward pass knowledge in neural networks

Knowledge distillation (KD) is one of the prominent techniques for model compression. In this method, the knowledge of a large network (teacher) is distilled into a model (student) with usually significantly fewer parameters. KD tries to…

Machine Learning · Computer Science 2023-01-31 Aref Jafari , Mehdi Rezagholizadeh , Ali Ghodsi

Understanding and Improving Knowledge Distillation

Knowledge Distillation (KD) is a model-agnostic technique to improve model quality while having a fixed capacity budget. It is a commonly used technique for model compression, where a larger capacity teacher model with better quality is…

Machine Learning · Computer Science 2021-03-02 Jiaxi Tang , Rakesh Shivanna , Zhe Zhao , Dong Lin , Anima Singh , Ed H. Chi , Sagar Jain

Gradient Knowledge Distillation for Pre-trained Language Models

Knowledge distillation (KD) is an effective framework to transfer knowledge from a large-scale teacher to a compact yet well-performing student. Previous KD practices for pre-trained language models mainly transfer knowledge by aligning…

Computation and Language · Computer Science 2022-11-03 Lean Wang , Lei Li , Xu Sun

Preparing Lessons: Improve Knowledge Distillation with Better Supervision

Knowledge distillation (KD) is widely used for training a compact model with the supervision of another large model, which could effectively improve the performance. Previous methods mainly focus on two aspects: 1) training the student to…

Computer Vision and Pattern Recognition · Computer Science 2020-07-27 Tiancheng Wen , Shenqi Lai , Xueming Qian

An Empirical Study of Knowledge Distillation for Code Understanding Tasks

Pre-trained language models (PLMs) have emerged as powerful tools for code understanding. However, deploying these PLMs in large-scale applications faces practical challenges due to their computational intensity and inference latency.…

Software Engineering · Computer Science 2025-08-22 Ruiqi Wang , Zezhou Yang , Cuiyun Gao , Xin Xia , Qing Liao

Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods

Knowledge distillation (KD) is an effective method for model compression and transferring knowledge between models. However, its effect on model's robustness against spurious correlations that degrade performance on out-of-distribution data…

Machine Learning · Computer Science 2025-10-31 Jiali Cheng , Chirag Agarwal , Hadi Amiri

Dynamic Knowledge Distillation for Pre-trained Language Models

Knowledge distillation~(KD) has been proved effective for compressing large-scale pre-trained language models. However, existing methods conduct KD statically, e.g., the student model aligns its output distribution to that of a selected…

Computation and Language · Computer Science 2021-09-24 Lei Li , Yankai Lin , Shuhuai Ren , Peng Li , Jie Zhou , Xu Sun

Can Students Beyond The Teacher? Distilling Knowledge from Teacher's Bias

Knowledge distillation (KD) is a model compression technique that transfers knowledge from a large teacher model to a smaller student model to enhance its performance. Existing methods often assume that the student model is inherently…

Computer Vision and Pattern Recognition · Computer Science 2024-12-16 Jianhua Zhang , Yi Gao , Ruyu Liu , Xu Cheng , Houxiang Zhang , Shengyong Chen

Heterogeneous Knowledge Distillation using Information Flow Modeling

Knowledge Distillation (KD) methods are capable of transferring the knowledge encoded in a large and complex teacher into a smaller and faster student. Early methods were usually limited to transferring the knowledge only between the last…

Computer Vision and Pattern Recognition · Computer Science 2020-05-05 Nikolaos Passalis , Maria Tzelepi , Anastasios Tefas

Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation

Knowledge distillation (KD) is a promising technique for model compression in neural machine translation. However, where the knowledge hides in KD is still not clear, which may hinder the development of KD. In this work, we first unravel…

Computation and Language · Computer Science 2024-07-18 Songming Zhang , Yunlong Liang , Shuaibo Wang , Wenjuan Han , Jian Liu , Jinan Xu , Yufeng Chen

The Role of Teacher Calibration in Knowledge Distillation

Knowledge Distillation (KD) has emerged as an effective model compression technique in deep learning, enabling the transfer of knowledge from a large teacher model to a compact student model. While KD has demonstrated significant success,…

Machine Learning · Computer Science 2025-08-29 Suyoung Kim , Seonguk Park , Junhoo Lee , Nojun Kwak

Knowledge Distillation Layer that Lets the Student Decide

Typical technique in knowledge distillation (KD) is regularizing the learning of a limited capacity model (student) by pushing its responses to match a powerful model's (teacher). Albeit useful especially in the penultimate layer and…

Computer Vision and Pattern Recognition · Computer Science 2023-09-08 Ada Gorgun , Yeti Z. Gurbuz , A. Aydin Alatan

Talking Models: Distill Pre-trained Knowledge to Downstream Models via Interactive Communication

Many recent breakthroughs in machine learning have been enabled by the pre-trained foundation models. By scaling up model parameters, training data, and computation resources, foundation models have significantly advanced the…

Artificial Intelligence · Computer Science 2023-10-06 Zhe Zhao , Qingyun Liu , Huan Gui , Bang An , Lichan Hong , Ed H. Chi

Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks

Deep neural models in recent years have been successful in almost every field, including extremely complex problem statements. However, these models are huge in size, with millions (and even billions) of parameters, thus demanding more…

Computer Vision and Pattern Recognition · Computer Science 2021-06-18 Lin Wang , Kuk-Jin Yoon