Related papers: CILDA: Contrastive Data Augmentation using Interme…

Revisiting Intermediate Layer Distillation for Compressing Language Models: An Overfitting Perspective

Knowledge distillation (KD) is a highly promising method for mitigating the computational problems of pre-trained language models (PLMs). Among various KD approaches, Intermediate Layer Distillation (ILD) has been a de facto standard KD…

Computation and Language · Computer Science 2023-02-06 Jongwoo Ko , Seungjoon Park , Minchan Jeong , Sukjin Hong , Euijai Ahn , Du-Seong Chang , Se-Young Yun

How to Select One Among All? An Extensive Empirical Study Towards the Robustness of Knowledge Distillation in Natural Language Understanding

Knowledge Distillation (KD) is a model compression algorithm that helps transfer the knowledge of a large neural network into a smaller one. Even though KD has shown promise on a wide range of Natural Language Processing (NLP) applications,…

Computation and Language · Computer Science 2021-09-21 Tianda Li , Ahmad Rashid , Aref Jafari , Pranav Sharma , Ali Ghodsi , Mehdi Rezagholizadeh

Exploring Inconsistent Knowledge Distillation for Object Detection with Data Augmentation

Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model. Since the teacher model perceives data in a way different from humans, existing KD methods only distill…

Computer Vision and Pattern Recognition · Computer Science 2024-02-22 Jiawei Liang , Siyuan Liang , Aishan Liu , Ke Ma , Jingzhi Li , Xiaochun Cao

MixKD: Towards Efficient Distillation of Large-scale Language Models

Large-scale language models have recently demonstrated impressive empirical performance. Nevertheless, the improved results are attained at the price of bigger models, more power consumption, and slower inference, which hinder their…

Computation and Language · Computer Science 2021-03-18 Kevin J Liang , Weituo Hao , Dinghan Shen , Yufan Zhou , Weizhu Chen , Changyou Chen , Lawrence Carin

Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models

Knowledge distillation (KD) is a technique for transferring knowledge from complex teacher models to simpler student models, significantly enhancing model efficiency and accuracy. It has demonstrated substantial advancements in various…

Computation and Language · Computer Science 2025-04-21 Junjie Yang , Junhao Song , Xudong Han , Ziqian Bi , Tianyang Wang , Chia Xin Liang , Xinyuan Song , Yichao Zhang , Qian Niu , Benji Peng , Keyu Chen , Ming Liu

Preparing Lessons: Improve Knowledge Distillation with Better Supervision

Knowledge distillation (KD) is widely used for training a compact model with the supervision of another large model, which could effectively improve the performance. Previous methods mainly focus on two aspects: 1) training the student to…

Computer Vision and Pattern Recognition · Computer Science 2020-07-27 Tiancheng Wen , Shenqi Lai , Xueming Qian

Online Knowledge Distillation via Mutual Contrastive Learning for Visual Recognition

The teacher-free online Knowledge Distillation (KD) aims to train an ensemble of multiple student models collaboratively and distill knowledge from each other. Although existing online KD methods achieve desirable performance, they often…

Computer Vision and Pattern Recognition · Computer Science 2023-03-28 Chuanguang Yang , Zhulin An , Helong Zhou , Fuzhen Zhuang , Yongjun Xu , Qian Zhan

Dynamic Contrastive Knowledge Distillation for Efficient Image Restoration

Knowledge distillation (KD) is a valuable yet challenging approach that enhances a compact student network by learning from a high-performance but cumbersome teacher model. However, previous KD methods for image restoration overlook the…

Computer Vision and Pattern Recognition · Computer Science 2024-12-18 Yunshuai Zhou , Junbo Qiao , Jincheng Liao , Wei Li , Simiao Li , Jiao Xie , Yunhang Shen , Jie Hu , Shaohui Lin

Improved knowledge distillation by utilizing backward pass knowledge in neural networks

Knowledge distillation (KD) is one of the prominent techniques for model compression. In this method, the knowledge of a large network (teacher) is distilled into a model (student) with usually significantly fewer parameters. KD tries to…

Machine Learning · Computer Science 2023-01-31 Aref Jafari , Mehdi Rezagholizadeh , Ali Ghodsi

Distilling Invariant Representations with Dual Augmentation

Knowledge distillation (KD) has been widely used to transfer knowledge from large, accurate models (teachers) to smaller, efficient ones (students). Recent methods have explored enforcing consistency by incorporating causal interpretations…

Computer Vision and Pattern Recognition · Computer Science 2025-07-17 Nikolaos Giakoumoglou , Tania Stathaki

Preview-based Category Contrastive Learning for Knowledge Distillation

Knowledge distillation is a mainstream algorithm in model compression by transferring knowledge from the larger model (teacher) to the smaller model (student) to improve the performance of student. Despite many efforts, existing methods…

Computer Vision and Pattern Recognition · Computer Science 2024-10-21 Muhe Ding , Jianlong Wu , Xue Dong , Xiaojie Li , Pengda Qin , Tian Gan , Liqiang Nie

Inter-KD: Intermediate Knowledge Distillation for CTC-Based Automatic Speech Recognition

Recently, the advance in deep learning has brought a considerable improvement in the end-to-end speech recognition field, simplifying the traditional pipeline while producing promising results. Among the end-to-end models, the connectionist…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-29 Ji Won Yoon , Beom Jun Woo , Sunghwan Ahn , Hyeonseung Lee , Nam Soo Kim

Continuation KD: Improved Knowledge Distillation through the Lens of Continuation Optimization

Knowledge Distillation (KD) has been extensively used for natural language understanding (NLU) tasks to improve a small model's (a student) generalization by transferring the knowledge from a larger model (a teacher). Although KD methods…

Machine Learning · Computer Science 2022-12-13 Aref Jafari , Ivan Kobyzev , Mehdi Rezagholizadeh , Pascal Poupart , Ali Ghodsi

Discriminative and Consistent Representation Distillation

Knowledge Distillation (KD) aims to transfer knowledge from a large teacher model to a smaller student model. While contrastive learning has shown promise in self-supervised learning by creating discriminative representations, its…

Computer Vision and Pattern Recognition · Computer Science 2025-05-14 Nikolaos Giakoumoglou , Tania Stathaki

MATE-KD: Masked Adversarial TExt, a Companion to Knowledge Distillation

The advent of large pre-trained language models has given rise to rapid progress in the field of Natural Language Processing (NLP). While the performance of these models on standard benchmarks has scaled with size, compression techniques…

Computation and Language · Computer Science 2021-05-14 Ahmad Rashid , Vasileios Lioutas , Mehdi Rezagholizadeh

Multi-level Knowledge Distillation via Knowledge Alignment and Correlation

Knowledge distillation (KD) has become an important technique for model compression and knowledge transfer. In this work, we first perform a comprehensive analysis of the knowledge transferred by different KD methods. We demonstrate that…

Computer Vision and Pattern Recognition · Computer Science 2021-06-07 Fei Ding , Yin Yang , Hongxin Hu , Venkat Krovi , Feng Luo

Towards Effective Data-Free Knowledge Distillation via Diverse Diffusion Augmentation

Data-free knowledge distillation (DFKD) has emerged as a pivotal technique in the domain of model compression, substantially reducing the dependency on the original training data. Nonetheless, conventional DFKD methods that employ…

Computer Vision and Pattern Recognition · Computer Science 2024-10-24 Muquan Li , Dongyang Zhang , Tao He , Xiurui Xie , Yuan-Fang Li , Ke Qin

Understanding the Role of Mixup in Knowledge Distillation: An Empirical Study

Mixup is a popular data augmentation technique based on creating new samples by linear interpolation between two given data samples, to improve both the generalization and robustness of the trained model. Knowledge distillation (KD), on the…

Computer Vision and Pattern Recognition · Computer Science 2022-11-10 Hongjun Choi , Eun Som Jeon , Ankita Shukla , Pavan Turaga

Residual Knowledge Distillation

Knowledge distillation (KD) is one of the most potent ways for model compression. The key idea is to transfer the knowledge from a deep teacher model (T) to a shallower student (S). However, existing methods suffer from performance…

Machine Learning · Computer Science 2020-02-24 Mengya Gao , Yujun Shen , Quanquan Li , Chen Change Loy

RAIL-KD: RAndom Intermediate Layer Mapping for Knowledge Distillation

Intermediate layer knowledge distillation (KD) can improve the standard KD technique (which only targets the output of teacher and student models) especially over large pre-trained language models. However, intermediate layer distillation…

Computation and Language · Computer Science 2021-10-05 Md Akmal Haidar , Nithin Anchuri , Mehdi Rezagholizadeh , Abbas Ghaddar , Philippe Langlais , Pascal Poupart