Related papers: Adaptive Explicit Knowledge Transfer for Knowledge…

Class-aware Information for Logit-based Knowledge Distillation

Knowledge distillation aims to transfer knowledge to the student model by utilizing the predictions/features of the teacher model, and feature-based distillation has recently shown its superiority over logit-based distillation. However, due…

Computer Vision and Pattern Recognition · Computer Science 2022-11-29 Shuoxi Zhang , Hanpeng Liu , John E. Hopcroft , Kun He

Revisiting Knowledge Distillation for Autoregressive Language Models

Knowledge distillation (KD) is a common approach to compress a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, in the context of autoregressive language models (LMs), we…

Computation and Language · Computer Science 2024-06-18 Qihuang Zhong , Liang Ding , Li Shen , Juhua Liu , Bo Du , Dacheng Tao

Parameter-Free Logit Distillation via Sorting Mechanism

Knowledge distillation (KD) aims to distill the knowledge from the teacher (larger) to the student (smaller) model via soft-label for the efficient neural network. In general, the performance of a model is determined by accuracy, which is…

Signal Processing · Electrical Eng. & Systems 2025-08-25 Stephen Ekaputra Limantoro

AD-KD: Attribution-Driven Knowledge Distillation for Language Model Compression

Knowledge distillation has attracted a great deal of interest recently to compress pre-trained language models. However, existing knowledge distillation methods suffer from two limitations. First, the student model simply imitates the…

Computation and Language · Computer Science 2023-05-18 Siyue Wu , Hongzhan Chen , Xiaojun Quan , Qifan Wang , Rui Wang

Gradient Knowledge Distillation for Pre-trained Language Models

Knowledge distillation (KD) is an effective framework to transfer knowledge from a large-scale teacher to a compact yet well-performing student. Previous KD practices for pre-trained language models mainly transfer knowledge by aligning…

Computation and Language · Computer Science 2022-11-03 Lean Wang , Lei Li , Xu Sun

Adaptive Group Robust Ensemble Knowledge Distillation

Neural networks can learn spurious correlations in the data, often leading to performance degradation for underrepresented subgroups. Studies have demonstrated that the disparity is amplified when knowledge is distilled from a complex…

Machine Learning · Computer Science 2025-11-11 Patrik Kenfack , Ulrich Aïvodji , Samira Ebrahimi Kahou

Evolving Knowledge Distillation for Lightweight Neural Machine Translation

Recent advancements in Neural Machine Translation (NMT) have significantly improved translation quality. However, the increasing size and complexity of state-of-the-art models present significant challenges for deployment on…

Computation and Language · Computer Science 2026-05-12 Xuewen Zhang , Haixiao Zhang , Xinlong Huang

AdaKD: Dynamic Knowledge Distillation of ASR models using Adaptive Loss Weighting

Knowledge distillation, a widely used model compression technique, works on the basis of transferring knowledge from a cumbersome teacher model to a lightweight student model. The technique involves jointly optimizing the task specific and…

Machine Learning · Computer Science 2024-05-15 Shreyan Ganguly , Roshan Nayak , Rakshith Rao , Ujan Deb , Prathosh AP

Progressive Class-level Distillation

In knowledge distillation (KD), logit distillation (LD) aims to transfer class-level knowledge from a more powerful teacher network to a small student model via accurate teacher-student alignment at the logits level. Since high-confidence…

Computer Vision and Pattern Recognition · Computer Science 2025-06-02 Jiayan Li , Jun Li , Zhourui Zhang , Jianhua Xu

A Closer Look at Knowledge Distillation with Features, Logits, and Gradients

Knowledge distillation (KD) is a substantial strategy for transferring learned knowledge from one neural network model to another. A vast number of methods have been developed for this strategy. While most method designs a more efficient…

Machine Learning · Computer Science 2022-03-22 Yen-Chang Hsu , James Smith , Yilin Shen , Zsolt Kira , Hongxia Jin

Learning Interpretation with Explainable Knowledge Distillation

Knowledge Distillation (KD) has been considered as a key solution in model compression and acceleration in recent years. In KD, a small student model is generally trained from a large teacher model by minimizing the divergence between the…

Machine Learning · Computer Science 2021-11-16 Raed Alharbi , Minh N. Vu , My T. Thai

Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation

Knowledge distillation (KD) methods can transfer knowledge of a parameter-heavy teacher model to a light-weight student model. The status quo for feature KD methods is to utilize loss functions based on logits (i.e., pre-softmax class…

Computer Vision and Pattern Recognition · Computer Science 2025-11-20 Nicholas Cooper , Lijun Chen , Sailesh Dwivedy , Danna Gurari

Context-Aware Knowledge Distillation with Adaptive Weighting for Image Classification

Knowledge distillation (KD) is a widely used technique to transfer knowledge from a large teacher network to a smaller student model. Traditional KD uses a fixed balancing factor alpha as a hyperparameter to combine the hard-label…

Computer Vision and Pattern Recognition · Computer Science 2025-09-09 Zhengda Li

Adaptive Teaching with Shared Classifier for Knowledge Distillation

Knowledge distillation (KD) is a technique used to transfer knowledge from an overparameterized teacher network to a less-parameterized student network, thereby minimizing the incurred performance loss. KD methods can be categorized into…

Computer Vision and Pattern Recognition · Computer Science 2024-06-17 Jaeyeon Jang , Young-Ik Kim , Jisu Lim , Hyeonseong Lee

Learning Efficient Detector with Semi-supervised Adaptive Distillation

Knowledge Distillation (KD) has been used in image classification for model compression. However, rare studies apply this technology on single-stage object detectors. Focal loss shows that the accumulated errors of easily-classified samples…

Computer Vision and Pattern Recognition · Computer Science 2019-01-15 Shitao Tang , Litong Feng , Wenqi Shao , Zhanghui Kuang , Wei Zhang , Yimin Chen

Swapped Logit Distillation via Bi-level Teacher Alignment

Knowledge distillation (KD) compresses the network capacity by transferring knowledge from a large (teacher) network to a smaller one (student). It has been mainstream that the teacher directly transfers knowledge to the student with its…

Machine Learning · Computer Science 2025-05-26 Stephen Ekaputra Limantoro , Jhe-Hao Lin , Chih-Yu Wang , Yi-Lung Tsai , Hong-Han Shuai , Ching-Chun Huang , Wen-Huang Cheng

Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models

Knowledge distillation (KD) is a technique for transferring knowledge from complex teacher models to simpler student models, significantly enhancing model efficiency and accuracy. It has demonstrated substantial advancements in various…

Computation and Language · Computer Science 2025-04-21 Junjie Yang , Junhao Song , Xudong Han , Ziqian Bi , Tianyang Wang , Chia Xin Liang , Xinyuan Song , Yichao Zhang , Qian Niu , Benji Peng , Keyu Chen , Ming Liu

PLD: A Choice-Theoretic List-Wise Knowledge Distillation

Knowledge distillation is a model compression technique in which a compact "student" network is trained to replicate the predictive behavior of a larger "teacher" network. In logit-based knowledge distillation, it has become the de facto…

Machine Learning · Computer Science 2026-05-12 Ejafa Bassam , Dawei Zhu , Kaigui Bian

AdaDistill: Adaptive Knowledge Distillation for Deep Face Recognition

Knowledge distillation (KD) aims at improving the performance of a compact student model by distilling the knowledge from a high-performing teacher model. In this paper, we present an adaptive KD approach, namely AdaDistill, for deep face…

Computer Vision and Pattern Recognition · Computer Science 2024-07-02 Fadi Boutros , Vitomir Štruc , Naser Damer

Improving Knowledge Distillation with Teacher's Explanation

Knowledge distillation (KD) improves the performance of a low-complexity student model with the help of a more powerful teacher. The teacher in KD is a black-box model, imparting knowledge to the student only through its predictions. This…

Machine Learning · Computer Science 2023-10-05 Sayantan Chowdhury , Ben Liang , Ali Tizghadam , Ilijc Albanese