Related papers: Multi-head Knowledge Distillation for Model Compre…

Improved knowledge distillation by utilizing backward pass knowledge in neural networks

Knowledge distillation (KD) is one of the prominent techniques for model compression. In this method, the knowledge of a large network (teacher) is distilled into a model (student) with usually significantly fewer parameters. KD tries to…

Machine Learning · Computer Science 2023-01-31 Aref Jafari , Mehdi Rezagholizadeh , Ali Ghodsi

Augmenting Knowledge Distillation With Peer-To-Peer Mutual Learning For Model Compression

Knowledge distillation (KD) is an effective model compression technique where a compact student network is taught to mimic the behavior of a complex and highly trained teacher network. In contrast, Mutual Learning (ML) provides an…

Computer Vision and Pattern Recognition · Computer Science 2021-10-25 Usma Niyaz , Deepti R. Bathula

Multi-level Knowledge Distillation via Knowledge Alignment and Correlation

Knowledge distillation (KD) has become an important technique for model compression and knowledge transfer. In this work, we first perform a comprehensive analysis of the knowledge transferred by different KD methods. We demonstrate that…

Computer Vision and Pattern Recognition · Computer Science 2021-06-07 Fei Ding , Yin Yang , Hongxin Hu , Venkat Krovi , Feng Luo

MLKD-BERT: Multi-level Knowledge Distillation for Pre-trained Language Models

Knowledge distillation is an effective technique for pre-trained language model compression. Although existing knowledge distillation methods perform well for the most typical model BERT, they could be further improved in two aspects: the…

Computation and Language · Computer Science 2024-07-04 Ying Zhang , Ziheng Yang , Shufan Ji

Distilling Knowledge by Mimicking Features

Knowledge distillation (KD) is a popular method to train efficient networks ("student") with the help of high-capacity networks ("teacher"). Traditional methods use the teacher's soft logits as extra supervision to train the student…

Computer Vision and Pattern Recognition · Computer Science 2021-08-17 Guo-Hua Wang , Yifan Ge , Jianxin Wu

Adaptive Multi-Teacher Knowledge Distillation with Meta-Learning

Multi-Teacher knowledge distillation provides students with additional supervision from multiple pre-trained teachers with diverse information sources. Most existing methods explore different weighting strategies to obtain a powerful…

Computer Vision and Pattern Recognition · Computer Science 2023-06-13 Hailin Zhang , Defang Chen , Can Wang

UHKD: A Unified Framework for Heterogeneous Knowledge Distillation via Frequency-Domain Representations

Knowledge distillation (KD) is an effective model compression technique that transfers knowledge from a high-performance teacher to a lightweight student, reducing computational and storage costs while maintaining competitive accuracy.…

Computer Vision and Pattern Recognition · Computer Science 2025-11-17 Fengming Yu , Haiwei Pan , Kejia Zhang , Jian Guan , Haiying Jiang

ALP-KD: Attention-Based Layer Projection for Knowledge Distillation

Knowledge distillation is considered as a training and compression strategy in which two neural networks, namely a teacher and a student, are coupled together during training. The teacher network is supposed to be a trustworthy predictor…

Computation and Language · Computer Science 2020-12-29 Peyman Passban , Yimeng Wu , Mehdi Rezagholizadeh , Qun Liu

Knowledge Distillation Beyond Model Compression

Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or an ensemble of models (teacher). Various…

Machine Learning · Computer Science 2020-07-08 Fahad Sarfraz , Elahe Arani , Bahram Zonooz

Multi-Hypothesis Distillation of Multilingual Neural Translation Models for Low-Resource Languages

This paper explores sequence-level knowledge distillation (KD) of multilingual pre-trained encoder-decoder translation models. We argue that the teacher model's output distribution holds valuable insights for the student, beyond the…

Computation and Language · Computer Science 2025-08-01 Aarón Galiano-Jiménez , Juan Antonio Pérez-Ortiz , Felipe Sánchez-Martínez , Víctor M. Sánchez-Cartagena

Efficient and Robust Knowledge Distillation from A Stronger Teacher Based on Correlation Matching

Knowledge Distillation (KD) has emerged as a pivotal technique for neural network compression and performance enhancement. Most KD methods aim to transfer dark knowledge from a cumbersome teacher model to a lightweight student model based…

Machine Learning · Computer Science 2024-10-10 Wenqi Niu , Yingchao Wang , Guohui Cai , Hanpo Hou

Knowledge Distillation with the Reused Teacher Classifier

Knowledge distillation aims to compress a powerful yet cumbersome teacher model into a lightweight student model without much sacrifice of performance. For this purpose, various approaches have been proposed over the past few years,…

Computer Vision and Pattern Recognition · Computer Science 2022-03-29 Defang Chen , Jian-Ping Mei , Hailin Zhang , Can Wang , Yan Feng , Chun Chen

An Embarrassingly Simple Approach for Knowledge Distillation

Knowledge Distillation (KD) aims at improving the performance of a low-capacity student model by inheriting knowledge from a high-capacity teacher model. Previous KD methods typically train a student by minimizing a task-related loss and…

Computer Vision and Pattern Recognition · Computer Science 2019-09-10 Mengya Gao , Yujun Shen , Quanquan Li , Junjie Yan , Liang Wan , Dahua Lin , Chen Change Loy , Xiaoou Tang

HPM-KD: Hierarchical Progressive Multi-Teacher Framework for Knowledge Distillation and Efficient Model Compression

Knowledge Distillation (KD) has emerged as a promising technique for model compression but faces critical limitations: (1) sensitivity to hyperparameters requiring extensive manual tuning, (2) capacity gap when distilling from very large…

Machine Learning · Computer Science 2025-12-11 Gustavo Coelho Haase , Paulo Henrique Dourado da Silva

Multi-Aspect Knowledge Distillation for Language Model with Low-rank Factorization

Knowledge distillation is an effective technique for pre-trained language model compression. However, existing methods only focus on the knowledge distribution among layers, which may cause the loss of fine-grained information in the…

Computation and Language · Computer Science 2026-04-06 Zihe Liu , Yulong Mao , Jinan Xu , Xinrui Peng , Kaiyu Huang

CrossKD: Cross-Head Knowledge Distillation for Object Detection

Knowledge Distillation (KD) has been validated as an effective model compression technique for learning compact object detectors. Existing state-of-the-art KD methods for object detection are mostly based on feature imitation. In this…

Computer Vision and Pattern Recognition · Computer Science 2024-04-16 Jiabao Wang , Yuming Chen , Zhaohui Zheng , Xiang Li , Ming-Ming Cheng , Qibin Hou

Cross-Layer Distillation with Semantic Calibration

Knowledge distillation is a technique to enhance the generalization ability of a student model by exploiting outputs from a teacher model. Recently, feature-map based variants explore knowledge transfer between manually assigned…

Computer Vision and Pattern Recognition · Computer Science 2021-08-31 Defang Chen , Jian-Ping Mei , Yuan Zhang , Can Wang , Yan Feng , Chun Chen

Residual Knowledge Distillation

Knowledge distillation (KD) is one of the most potent ways for model compression. The key idea is to transfer the knowledge from a deep teacher model (T) to a shallower student (S). However, existing methods suffer from performance…

Machine Learning · Computer Science 2020-02-24 Mengya Gao , Yujun Shen , Quanquan Li , Chen Change Loy

Knowledge Distillation: Enhancing Neural Network Compression with Integrated Gradients

Efficient deployment of deep neural networks on resource-constrained devices demands advanced compression techniques that preserve accuracy and interoperability. This paper proposes a machine learning framework that augments Knowledge…

Machine Learning · Computer Science 2025-03-18 David E. Hernandez , Jose Ramon Chang , Torbjörn E. M. Nordling

Harmonizing knowledge Transfer in Neural Network with Unified Distillation

Knowledge distillation (KD), known for its ability to transfer knowledge from a cumbersome network (teacher) to a lightweight one (student) without altering the architecture, has been garnering increasing attention. Two primary categories…

Computer Vision and Pattern Recognition · Computer Science 2024-09-30 Yaomin Huang , Zaomin Yan , Chaomin Shen , Faming Fang , Guixu Zhang