Related papers: KDExplainer: A Task-oriented Attention Model for E…

A Comprehensive Survey on Knowledge Distillation

Deep Neural Networks (DNNs) have achieved notable performance in the fields of computer vision and natural language processing with various applications in both academia and industry. However, with recent advancements in DNNs and…

Computer Vision and Pattern Recognition · Computer Science 2025-10-14 Amir M. Mansourian , Rozhan Ahmadi , Masoud Ghafouri , Amir Mohammad Babaei , Elaheh Badali Golezani , Zeynab Yasamani Ghamchi , Vida Ramezanian , Alireza Taherian , Kimia Dinashi , Amirali Miri , Shohreh Kasaei

Learning Interpretation with Explainable Knowledge Distillation

Knowledge Distillation (KD) has been considered as a key solution in model compression and acceleration in recent years. In KD, a small student model is generally trained from a large teacher model by minimizing the divergence between the…

Machine Learning · Computer Science 2021-11-16 Raed Alharbi , Minh N. Vu , My T. Thai

Modeling Teacher-Student Techniques in Deep Neural Networks for Knowledge Distillation

Knowledge distillation (KD) is a new method for transferring knowledge of a structure under training to another one. The typical application of KD is in the form of learning a small model (named as a student) by soft labels produced by a…

Computer Vision and Pattern Recognition · Computer Science 2020-01-01 Sajjad Abbasi , Mohsen Hajabdollahi , Nader Karimi , Shadrokh Samavi

A Survey on Recent Teacher-student Learning Studies

Knowledge distillation is a method of transferring the knowledge from a complex deep neural network (DNN) to a smaller and faster DNN, while preserving its accuracy. Recent variants of knowledge distillation include teaching assistant…

Machine Learning · Computer Science 2023-04-11 Minghong Gao

Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks

Deep neural models in recent years have been successful in almost every field, including extremely complex problem statements. However, these models are huge in size, with millions (and even billions) of parameters, thus demanding more…

Computer Vision and Pattern Recognition · Computer Science 2021-06-18 Lin Wang , Kuk-Jin Yoon

An Empirical Study of Knowledge Distillation for Code Understanding Tasks

Pre-trained language models (PLMs) have emerged as powerful tools for code understanding. However, deploying these PLMs in large-scale applications faces practical challenges due to their computational intensity and inference latency.…

Software Engineering · Computer Science 2025-08-22 Ruiqi Wang , Zezhou Yang , Cuiyun Gao , Xin Xia , Qing Liao

Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models

Knowledge distillation (KD) is a technique for transferring knowledge from complex teacher models to simpler student models, significantly enhancing model efficiency and accuracy. It has demonstrated substantial advancements in various…

Computation and Language · Computer Science 2025-04-21 Junjie Yang , Junhao Song , Xudong Han , Ziqian Bi , Tianyang Wang , Chia Xin Liang , Xinyuan Song , Yichao Zhang , Qian Niu , Benji Peng , Keyu Chen , Ming Liu

Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation

Knowledge distillation (KD) is a promising technique for model compression in neural machine translation. However, where the knowledge hides in KD is still not clear, which may hinder the development of KD. In this work, we first unravel…

Computation and Language · Computer Science 2024-07-18 Songming Zhang , Yunlong Liang , Shuaibo Wang , Wenjuan Han , Jian Liu , Jinan Xu , Yufeng Chen

Knowledge Distillation Performs Partial Variance Reduction

Knowledge distillation is a popular approach for enhancing the performance of ''student'' models, with lower representational capacity, by taking advantage of more powerful ''teacher'' models. Despite its apparent simplicity and widespread…

Machine Learning · Computer Science 2023-12-12 Mher Safaryan , Alexandra Peste , Dan Alistarh

Improving Knowledge Distillation with Teacher's Explanation

Knowledge distillation (KD) improves the performance of a low-complexity student model with the help of a more powerful teacher. The teacher in KD is a black-box model, imparting knowledge to the student only through its predictions. This…

Machine Learning · Computer Science 2023-10-05 Sayantan Chowdhury , Ben Liang , Ali Tizghadam , Ilijc Albanese

Knowledge Distillation Beyond Model Compression

Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or an ensemble of models (teacher). Various…

Machine Learning · Computer Science 2020-07-08 Fahad Sarfraz , Elahe Arani , Bahram Zonooz

Align-to-Distill: Trainable Attention Alignment for Knowledge Distillation in Neural Machine Translation

The advent of scalable deep models and large datasets has improved the performance of Neural Machine Translation. Knowledge Distillation (KD) enhances efficiency by transferring knowledge from a teacher model to a more compact student…

Computation and Language · Computer Science 2024-03-26 Heegon Jin , Seonil Son , Jemin Park , Youngseok Kim , Hyungjong Noh , Yeonsoo Lee

Knowledge Distillation Layer that Lets the Student Decide

Typical technique in knowledge distillation (KD) is regularizing the learning of a limited capacity model (student) by pushing its responses to match a powerful model's (teacher). Albeit useful especially in the penultimate layer and…

Computer Vision and Pattern Recognition · Computer Science 2023-09-08 Ada Gorgun , Yeti Z. Gurbuz , A. Aydin Alatan

Knowledge Condensation Distillation

Knowledge Distillation (KD) transfers the knowledge from a high-capacity teacher network to strengthen a smaller student. Existing methods focus on excavating the knowledge hints and transferring the whole knowledge to the student. However,…

Computer Vision and Pattern Recognition · Computer Science 2022-07-13 Chenxin Li , Mingbao Lin , Zhiyuan Ding , Nie Lin , Yihong Zhuang , Yue Huang , Xinghao Ding , Liujuan Cao

Gradient Knowledge Distillation for Pre-trained Language Models

Knowledge distillation (KD) is an effective framework to transfer knowledge from a large-scale teacher to a compact yet well-performing student. Previous KD practices for pre-trained language models mainly transfer knowledge by aligning…

Computation and Language · Computer Science 2022-11-03 Lean Wang , Lei Li , Xu Sun

Improved knowledge distillation by utilizing backward pass knowledge in neural networks

Knowledge distillation (KD) is one of the prominent techniques for model compression. In this method, the knowledge of a large network (teacher) is distilled into a model (student) with usually significantly fewer parameters. KD tries to…

Machine Learning · Computer Science 2023-01-31 Aref Jafari , Mehdi Rezagholizadeh , Ali Ghodsi

Class Attention Transfer Based Knowledge Distillation

Previous knowledge distillation methods have shown their impressive performance on model compression tasks, however, it is hard to explain how the knowledge they transferred helps to improve the performance of the student network. In this…

Computer Vision and Pattern Recognition · Computer Science 2023-04-26 Ziyao Guo , Haonan Yan , Hui Li , Xiaodong Lin

A Selective Survey on Versatile Knowledge Distillation Paradigm for Neural Network Models

This paper aims to provide a selective survey about knowledge distillation(KD) framework for researchers and practitioners to take advantage of it for developing new optimized models in the deep neural network field. To this end, we give a…

Machine Learning · Computer Science 2020-12-01 Jeong-Hoe Ku , JiHun Oh , YoungYoon Lee , Gaurav Pooniwala , SangJeong Lee

Understanding and Improving Knowledge Distillation

Knowledge Distillation (KD) is a model-agnostic technique to improve model quality while having a fixed capacity budget. It is a commonly used technique for model compression, where a larger capacity teacher model with better quality is…

Machine Learning · Computer Science 2021-03-02 Jiaxi Tang , Rakesh Shivanna , Zhe Zhao , Dong Lin , Anima Singh , Ed H. Chi , Sagar Jain

Representative Teacher Keys for Knowledge Distillation Model Compression Based on Attention Mechanism for Image Classification

With the improvement of AI chips (e.g., GPU, TPU, and NPU) and the fast development of the Internet of Things (IoT), some robust deep neural networks (DNNs) are usually composed of millions or even hundreds of millions of parameters. Such a…

Computer Vision and Pattern Recognition · Computer Science 2022-10-21 Jun-Teng Yang , Sheng-Che Kao , Scott C. -H. Huang