Related papers: Decoupled Knowledge Distillation

Rethinking Decoupled Knowledge Distillation: A Predictive Distribution Perspective

In the history of knowledge distillation, the focus has once shifted over time from logit-based to feature-based approaches. However, this transition has been revisited with the advent of Decoupled Knowledge Distillation (DKD), which…

Machine Learning · Computer Science 2025-12-05 Bowen Zheng , Ran Cheng

Decoupling Dark Knowledge via Block-wise Logit Distillation for Feature-level Alignment

Knowledge Distillation (KD), a learning manner with a larger teacher network guiding a smaller student network, transfers dark knowledge from the teacher to the student via logits or intermediate features, with the aim of producing a…

Machine Learning · Computer Science 2024-12-04 Chengting Yu , Fengzhao Zhang , Ruizhe Chen , Aili Wang , Zuozhu Liu , Shurun Tan , Er-Ping Li

Class-aware Information for Logit-based Knowledge Distillation

Knowledge distillation aims to transfer knowledge to the student model by utilizing the predictions/features of the teacher model, and feature-based distillation has recently shown its superiority over logit-based distillation. However, due…

Computer Vision and Pattern Recognition · Computer Science 2022-11-29 Shuoxi Zhang , Hanpeng Liu , John E. Hopcroft , Kun He

Grouped Knowledge Distillation for Deep Face Recognition

Compared with the feature-based distillation methods, logits distillation can liberalize the requirements of consistent feature dimension between teacher and student networks, while the performance is deemed inferior in face recognition.…

Computer Vision and Pattern Recognition · Computer Science 2023-04-11 Weisong Zhao , Xiangyu Zhu , Kaiwen Guo , Xiao-Yu Zhang , Zhen Lei

TopKD: Top-scaled Knowledge Distillation

Recent advances in knowledge distillation (KD) predominantly emphasize feature-level knowledge transfer, frequently overlooking critical information embedded within the teacher's logit distributions. In this paper, we revisit logit-based…

Computer Vision and Pattern Recognition · Computer Science 2025-08-07 Qi Wang , Jinjia Zhou

Scale Decoupled Distillation

Logit knowledge distillation attracts increasing attention due to its practicality in recent studies. However, it often suffers inferior performance compared to the feature knowledge distillation. In this paper, we argue that existing…

Computer Vision and Pattern Recognition · Computer Science 2024-03-21 Shicai Wei Chunbo Luo Yang Luo

A Closer Look at Knowledge Distillation with Features, Logits, and Gradients

Knowledge distillation (KD) is a substantial strategy for transferring learned knowledge from one neural network model to another. A vast number of methods have been developed for this strategy. While most method designs a more efficient…

Machine Learning · Computer Science 2022-03-22 Yen-Chang Hsu , James Smith , Yilin Shen , Zsolt Kira , Hongxia Jin

Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation

Knowledge distillation (KD) methods can transfer knowledge of a parameter-heavy teacher model to a light-weight student model. The status quo for feature KD methods is to utilize loss functions based on logits (i.e., pre-softmax class…

Computer Vision and Pattern Recognition · Computer Science 2025-11-20 Nicholas Cooper , Lijun Chen , Sailesh Dwivedy , Danna Gurari

From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels

Knowledge Distillation (KD) uses the teacher's prediction logits as soft labels to guide the student, while self-KD does not need a real teacher to require the soft labels. This work unifies the formulations of the two tasks by decomposing…

Computer Vision and Pattern Recognition · Computer Science 2023-07-18 Zhendong Yang , Ailing Zeng , Zhe Li , Tianke Zhang , Chun Yuan , Yu Li

An Embarrassingly Simple Approach for Knowledge Distillation

Knowledge Distillation (KD) aims at improving the performance of a low-capacity student model by inheriting knowledge from a high-capacity teacher model. Previous KD methods typically train a student by minimizing a task-related loss and…

Computer Vision and Pattern Recognition · Computer Science 2019-09-10 Mengya Gao , Yujun Shen , Quanquan Li , Junjie Yan , Liang Wan , Dahua Lin , Chen Change Loy , Xiaoou Tang

Neural Collapse Inspired Knowledge Distillation

Existing knowledge distillation (KD) methods have demonstrated their ability in achieving student network performance on par with their teachers. However, the knowledge gap between the teacher and student remains significant and may hinder…

Computer Vision and Pattern Recognition · Computer Science 2024-12-17 Shuoxi Zhang , Zijian Song , Kun He

DeepKD: A Deeply Decoupled and Denoised Knowledge Distillation Trainer

Recent advances in knowledge distillation have emphasized the importance of decoupling different knowledge components. While existing methods utilize momentum mechanisms to separate task-oriented and distillation gradients, they overlook…

Computer Vision and Pattern Recognition · Computer Science 2025-05-22 Haiduo Huang , Jiangcheng Song , Yadong Zhang , Pengju Ren

Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models

Knowledge distillation (KD) is a technique for transferring knowledge from complex teacher models to simpler student models, significantly enhancing model efficiency and accuracy. It has demonstrated substantial advancements in various…

Computation and Language · Computer Science 2025-04-21 Junjie Yang , Junhao Song , Xudong Han , Ziqian Bi , Tianyang Wang , Chia Xin Liang , Xinyuan Song , Yichao Zhang , Qian Niu , Benji Peng , Keyu Chen , Ming Liu

Progressive Class-level Distillation

In knowledge distillation (KD), logit distillation (LD) aims to transfer class-level knowledge from a more powerful teacher network to a small student model via accurate teacher-student alignment at the logits level. Since high-confidence…

Computer Vision and Pattern Recognition · Computer Science 2025-06-02 Jiayan Li , Jun Li , Zhourui Zhang , Jianhua Xu

Decomposed Knowledge Distillation for Class-Incremental Semantic Segmentation

Class-incremental semantic segmentation (CISS) labels each pixel of an image with a corresponding object/stuff class continually. To this end, it is crucial to learn novel classes incrementally without forgetting previously learned…

Computer Vision and Pattern Recognition · Computer Science 2022-10-13 Donghyeon Baek , Youngmin Oh , Sanghoon Lee , Junghyup Lee , Bumsub Ham

Dual-Head Knowledge Distillation: Enhancing Logits Utilization with an Auxiliary Head

Traditional knowledge distillation focuses on aligning the student's predicted probabilities with both ground-truth labels and the teacher's predicted probabilities. However, the transition to predicted probabilities from logits would…

Computer Vision and Pattern Recognition · Computer Science 2026-04-08 Penghui Yang , Chen-Chen Zong , Sheng-Jun Huang , Lei Feng , Bo An

Residual Knowledge Distillation

Knowledge distillation (KD) is one of the most potent ways for model compression. The key idea is to transfer the knowledge from a deep teacher model (T) to a shallower student (S). However, existing methods suffer from performance…

Machine Learning · Computer Science 2020-02-24 Mengya Gao , Yujun Shen , Quanquan Li , Chen Change Loy

What is Lost in Knowledge Distillation?

Deep neural networks (DNNs) have improved NLP tasks significantly, but training and maintaining such networks could be costly. Model compression techniques, such as, knowledge distillation (KD), have been proposed to address the issue;…

Computation and Language · Computer Science 2023-11-08 Manas Mohanty , Tanya Roosta , Peyman Passban

CKD: Contrastive Knowledge Distillation from A Sample-wise Perspective

In this paper, we propose a simple yet effective contrastive knowledge distillation framework that achieves sample-wise logit alignment while preserving semantic consistency. Conventional knowledge distillation approaches exhibit…

Computer Vision and Pattern Recognition · Computer Science 2025-03-26 Wencheng Zhu , Xin Zhou , Pengfei Zhu , Yu Wang , Qinghua Hu

Localization Distillation for Object Detection

Previous knowledge distillation (KD) methods for object detection mostly focus on feature imitation instead of mimicking the prediction logits due to its inefficiency in distilling the localization information. In this paper, we investigate…

Computer Vision and Pattern Recognition · Computer Science 2022-12-09 Zhaohui Zheng , Rongguang Ye , Qibin Hou , Dongwei Ren , Ping Wang , Wangmeng Zuo , Ming-Ming Cheng