Related papers: Instance Temperature Knowledge Distillation

Dynamic Temperature Knowledge Distillation

Temperature plays a pivotal role in moderating label softness in the realm of knowledge distillation (KD). Traditional approaches often employ a static temperature throughout the KD process, which fails to address the nuanced complexities…

Machine Learning · Computer Science 2024-04-22 Yukang Wei , Yu Bai

Preparing Lessons: Improve Knowledge Distillation with Better Supervision

Knowledge distillation (KD) is widely used for training a compact model with the supervision of another large model, which could effectively improve the performance. Previous methods mainly focus on two aspects: 1) training the student to…

Computer Vision and Pattern Recognition · Computer Science 2020-07-27 Tiancheng Wen , Shenqi Lai , Xueming Qian

Curriculum Temperature for Knowledge Distillation

Most existing distillation methods ignore the flexible role of the temperature in the loss function and fix it as a hyper-parameter that can be decided by an inefficient grid search. In general, the temperature controls the discrepancy…

Computer Vision and Pattern Recognition · Computer Science 2022-12-27 Zheng Li , Xiang Li , Lingfeng Yang , Borui Zhao , Renjie Song , Lei Luo , Jun Li , Jian Yang

Knowledge Distillation Layer that Lets the Student Decide

Typical technique in knowledge distillation (KD) is regularizing the learning of a limited capacity model (student) by pushing its responses to match a powerful model's (teacher). Albeit useful especially in the penultimate layer and…

Computer Vision and Pattern Recognition · Computer Science 2023-09-08 Ada Gorgun , Yeti Z. Gurbuz , A. Aydin Alatan

Modeling Teacher-Student Techniques in Deep Neural Networks for Knowledge Distillation

Knowledge distillation (KD) is a new method for transferring knowledge of a structure under training to another one. The typical application of KD is in the form of learning a small model (named as a student) by soft labels produced by a…

Computer Vision and Pattern Recognition · Computer Science 2020-01-01 Sajjad Abbasi , Mohsen Hajabdollahi , Nader Karimi , Shadrokh Samavi

Learning to Teach with Student Feedback

Knowledge distillation (KD) has gained much attention due to its effectiveness in compressing large-scale pre-trained models. In typical KD methods, the small student model is trained to match the soft targets generated by the big teacher…

Machine Learning · Computer Science 2021-09-13 Yitao Liu , Tianxiang Sun , Xipeng Qiu , Xuanjing Huang

Dynamic Knowledge Distillation for Pre-trained Language Models

Knowledge distillation~(KD) has been proved effective for compressing large-scale pre-trained language models. However, existing methods conduct KD statically, e.g., the student model aligns its output distribution to that of a selected…

Computation and Language · Computer Science 2021-09-24 Lei Li , Yankai Lin , Shuhuai Ren , Peng Li , Jie Zhou , Xu Sun

Relational Knowledge Distillation

Knowledge distillation aims at transferring knowledge acquired in one model (a teacher) to another model (a student) that is typically smaller. Previous approaches can be expressed as a form of training the student to mimic output…

Computer Vision and Pattern Recognition · Computer Science 2019-05-02 Wonpyo Park , Dongju Kim , Yan Lu , Minsu Cho

Understanding and Improving Knowledge Distillation

Knowledge Distillation (KD) is a model-agnostic technique to improve model quality while having a fixed capacity budget. It is a commonly used technique for model compression, where a larger capacity teacher model with better quality is…

Machine Learning · Computer Science 2021-03-02 Jiaxi Tang , Rakesh Shivanna , Zhe Zhao , Dong Lin , Anima Singh , Ed H. Chi , Sagar Jain

Interactive Knowledge Distillation

Knowledge distillation is a standard teacher-student learning framework to train a light-weight student network under the guidance of a well-trained large teacher network. As an effective teaching strategy, interactive teaching has been…

Computer Vision and Pattern Recognition · Computer Science 2021-04-16 Shipeng Fu , Zhen Li , Jun Xu , Ming-Ming Cheng , Zitao Liu , Xiaomin Yang

Dynamic Temperature Scheduler for Knowledge Distillation

Knowledge Distillation (KD) trains a smaller student model using a large, pre-trained teacher model, with temperature as a key hyperparameter controlling the softness of output probabilities. Traditional methods use a fixed temperature…

Machine Learning · Computer Science 2025-11-19 Sibgat Ul Islam , Jawad Ibn Ahad , Fuad Rahman , Mohammad Ruhul Amin , Nabeel Mohammed , Shafin Rahman

LLM-Oriented Token-Adaptive Knowledge Distillation

Knowledge distillation (KD) is a key technique for compressing large-scale language models (LLMs), yet prevailing logit-based methods typically employ static strategies that are misaligned with the dynamic learning process of student…

Computation and Language · Computer Science 2025-10-14 Xurong Xie , Zhucun Xue , Jiafu Wu , Jian Li , Yabiao Wang , Xiaobin Hu , Yong Liu , Jiangning Zhang

Locally Linear Region Knowledge Distillation

Knowledge distillation (KD) is an effective technique to transfer knowledge from one neural network (teacher) to another (student), thus improving the performance of the student. To make the student better mimic the behavior of the teacher,…

Machine Learning · Computer Science 2020-10-20 Xiang Deng , Zhongfei , Zhang

Gradient Knowledge Distillation for Pre-trained Language Models

Knowledge distillation (KD) is an effective framework to transfer knowledge from a large-scale teacher to a compact yet well-performing student. Previous KD practices for pre-trained language models mainly transfer knowledge by aligning…

Computation and Language · Computer Science 2022-11-03 Lean Wang , Lei Li , Xu Sun

Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models

Knowledge distillation (KD) is a technique for transferring knowledge from complex teacher models to simpler student models, significantly enhancing model efficiency and accuracy. It has demonstrated substantial advancements in various…

Computation and Language · Computer Science 2025-04-21 Junjie Yang , Junhao Song , Xudong Han , Ziqian Bi , Tianyang Wang , Chia Xin Liang , Xinyuan Song , Yichao Zhang , Qian Niu , Benji Peng , Keyu Chen , Ming Liu

Residual Knowledge Distillation

Knowledge distillation (KD) is one of the most potent ways for model compression. The key idea is to transfer the knowledge from a deep teacher model (T) to a shallower student (S). However, existing methods suffer from performance…

Machine Learning · Computer Science 2020-02-24 Mengya Gao , Yujun Shen , Quanquan Li , Chen Change Loy

Role-Wise Data Augmentation for Knowledge Distillation

Knowledge Distillation (KD) is a common method for transferring the ``knowledge'' learned by one machine learning model (the \textit{teacher}) into another model (the \textit{student}), where typically, the teacher has a greater capacity…

Machine Learning · Computer Science 2020-04-21 Jie Fu , Xue Geng , Zhijian Duan , Bohan Zhuang , Xingdi Yuan , Adam Trischler , Jie Lin , Chris Pal , Hao Dong

Can Students Beyond The Teacher? Distilling Knowledge from Teacher's Bias

Knowledge distillation (KD) is a model compression technique that transfers knowledge from a large teacher model to a smaller student model to enhance its performance. Existing methods often assume that the student model is inherently…

Computer Vision and Pattern Recognition · Computer Science 2024-12-16 Jianhua Zhang , Yi Gao , Ruyu Liu , Xu Cheng , Houxiang Zhang , Shengyong Chen

Knowledge Condensation Distillation

Knowledge Distillation (KD) transfers the knowledge from a high-capacity teacher network to strengthen a smaller student. Existing methods focus on excavating the knowledge hints and transferring the whole knowledge to the student. However,…

Computer Vision and Pattern Recognition · Computer Science 2022-07-13 Chenxin Li , Mingbao Lin , Zhiyuan Ding , Nie Lin , Yihong Zhuang , Yue Huang , Xinghao Ding , Liujuan Cao

Dynamic Rectification Knowledge Distillation

Knowledge Distillation is a technique which aims to utilize dark knowledge to compress and transfer information from a vast, well-trained neural network (teacher model) to a smaller, less capable neural network (student model) with improved…

Computer Vision and Pattern Recognition · Computer Science 2022-01-28 Fahad Rahman Amik , Ahnaf Ismat Tasin , Silvia Ahmed , M. M. Lutfe Elahi , Nabeel Mohammed