Related papers: Meta Knowledge Distillation

Preparing Lessons: Improve Knowledge Distillation with Better Supervision

Knowledge distillation (KD) is widely used for training a compact model with the supervision of another large model, which could effectively improve the performance. Previous methods mainly focus on two aspects: 1) training the student to…

Computer Vision and Pattern Recognition · Computer Science 2020-07-27 Tiancheng Wen , Shenqi Lai , Xueming Qian

Understanding and Improving Knowledge Distillation

Knowledge Distillation (KD) is a model-agnostic technique to improve model quality while having a fixed capacity budget. It is a commonly used technique for model compression, where a larger capacity teacher model with better quality is…

Machine Learning · Computer Science 2021-03-02 Jiaxi Tang , Rakesh Shivanna , Zhe Zhao , Dong Lin , Anima Singh , Ed H. Chi , Sagar Jain

MoKD: Multi-Task Optimization for Knowledge Distillation

Compact models can be effectively trained through Knowledge Distillation (KD), a technique that transfers knowledge from larger, high-performing teacher models. Two key challenges in Knowledge Distillation (KD) are: 1) balancing learning…

Computer Vision and Pattern Recognition · Computer Science 2025-08-05 Zeeshan Hayder , Ali Cheraghian , Lars Petersson , Mehrtash Harandi

BERT Learns to Teach: Knowledge Distillation with Meta Learning

We present Knowledge Distillation with Meta Learning (MetaDistil), a simple yet effective alternative to traditional knowledge distillation (KD) methods where the teacher model is fixed during training. We show the teacher network can learn…

Machine Learning · Computer Science 2022-04-05 Wangchunshu Zhou , Canwen Xu , Julian McAuley

Evolving Knowledge Distillation for Lightweight Neural Machine Translation

Recent advancements in Neural Machine Translation (NMT) have significantly improved translation quality. However, the increasing size and complexity of state-of-the-art models present significant challenges for deployment on…

Computation and Language · Computer Science 2026-05-12 Xuewen Zhang , Haixiao Zhang , Xinlong Huang

Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation

Knowledge distillation (KD) is a promising technique for model compression in neural machine translation. However, where the knowledge hides in KD is still not clear, which may hinder the development of KD. In this work, we first unravel…

Computation and Language · Computer Science 2024-07-18 Songming Zhang , Yunlong Liang , Shuaibo Wang , Wenjuan Han , Jian Liu , Jinan Xu , Yufeng Chen

HPM-KD: Hierarchical Progressive Multi-Teacher Framework for Knowledge Distillation and Efficient Model Compression

Knowledge Distillation (KD) has emerged as a promising technique for model compression but faces critical limitations: (1) sensitivity to hyperparameters requiring extensive manual tuning, (2) capacity gap when distilling from very large…

Machine Learning · Computer Science 2025-12-11 Gustavo Coelho Haase , Paulo Henrique Dourado da Silva

Towards Comparable Knowledge Distillation in Semantic Image Segmentation

Knowledge Distillation (KD) is one proposed solution to large model sizes and slow inference speed in semantic segmentation. In our research we identify 25 proposed distillation loss terms from 14 publications in the last 4 years.…

Computer Vision and Pattern Recognition · Computer Science 2023-09-08 Onno Niemann , Christopher Vox , Thorben Werner

Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains

Pre-trained language models have been applied to various NLP tasks with considerable performance gains. However, the large model sizes, together with the long inference time, limit the deployment of such models in real-time applications.…

Computation and Language · Computer Science 2022-11-03 Haojie Pan , Chengyu Wang , Minghui Qiu , Yichang Zhang , Yaliang Li , Jun Huang

On the Generalization vs Fidelity Paradox in Knowledge Distillation

Knowledge distillation (KD) is a key technique for compressing large language models into smaller ones while preserving performance. Despite the recent traction of KD research, its effectiveness for smaller language models (LMs) and the…

Computation and Language · Computer Science 2025-08-05 Suhas Kamasetty Ramesh , Ayan Sengupta , Tanmoy Chakraborty

MixKD: Towards Efficient Distillation of Large-scale Language Models

Large-scale language models have recently demonstrated impressive empirical performance. Nevertheless, the improved results are attained at the price of bigger models, more power consumption, and slower inference, which hinder their…

Computation and Language · Computer Science 2021-03-18 Kevin J Liang , Weituo Hao , Dinghan Shen , Yufan Zhou , Weizhu Chen , Changyou Chen , Lawrence Carin

Memorization Dynamics in Knowledge Distillation for Language Models

Knowledge Distillation (KD) is increasingly adopted to transfer capabilities from large language models to smaller ones, offering significant improvements in efficiency and utility while often surpassing standard fine-tuning. Beyond…

Computation and Language · Computer Science 2026-01-23 Jaydeep Borkar , Karan Chadha , Niloofar Mireshghallah , Yuchen Zhang , Irina-Elena Veliche , Archi Mitra , David A. Smith , Zheng Xu , Diego Garcia-Olano

Dynamic Knowledge Distillation for Pre-trained Language Models

Knowledge distillation~(KD) has been proved effective for compressing large-scale pre-trained language models. However, existing methods conduct KD statically, e.g., the student model aligns its output distribution to that of a selected…

Computation and Language · Computer Science 2021-09-24 Lei Li , Yankai Lin , Shuhuai Ren , Peng Li , Jie Zhou , Xu Sun

Adaptive Multi-Teacher Knowledge Distillation with Meta-Learning

Multi-Teacher knowledge distillation provides students with additional supervision from multiple pre-trained teachers with diverse information sources. Most existing methods explore different weighting strategies to obtain a powerful…

Computer Vision and Pattern Recognition · Computer Science 2023-06-13 Hailin Zhang , Defang Chen , Can Wang

MemKD: Memory-Discrepancy Knowledge Distillation for Efficient Time Series Classification

Deep learning models, particularly recurrent neural networks and their variants, such as long short-term memory, have significantly advanced time series data analysis. These models capture complex, sequential patterns in time series,…

Machine Learning · Computer Science 2026-01-12 Nilushika Udayangani , Kishor Nandakishor , Marimuthu Palaniswami

Improving Knowledge Distillation with Teacher's Explanation

Knowledge distillation (KD) improves the performance of a low-complexity student model with the help of a more powerful teacher. The teacher in KD is a black-box model, imparting knowledge to the student only through its predictions. This…

Machine Learning · Computer Science 2023-10-05 Sayantan Chowdhury , Ben Liang , Ali Tizghadam , Ilijc Albanese

Knowledge distillation for optimization of quantized deep neural networks

Knowledge distillation (KD) is a very popular method for model size reduction. Recently, the technique is exploited for quantized deep neural networks (QDNNs) training as a way to restore the performance sacrificed by word-length reduction.…

Machine Learning · Computer Science 2019-10-24 Sungho Shin , Yoonho Boo , Wonyong Sung

Multi-level Knowledge Distillation via Knowledge Alignment and Correlation

Knowledge distillation (KD) has become an important technique for model compression and knowledge transfer. In this work, we first perform a comprehensive analysis of the knowledge transferred by different KD methods. We demonstrate that…

Computer Vision and Pattern Recognition · Computer Science 2021-06-07 Fei Ding , Yin Yang , Hongxin Hu , Venkat Krovi , Feng Luo

Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks

Deep neural models in recent years have been successful in almost every field, including extremely complex problem statements. However, these models are huge in size, with millions (and even billions) of parameters, thus demanding more…

Computer Vision and Pattern Recognition · Computer Science 2021-06-18 Lin Wang , Kuk-Jin Yoon

Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation

The widespread deployment of Large Language Models (LLMs) is hindered by the high computational demands, making knowledge distillation (KD) crucial for developing compact smaller ones. However, the conventional KD methods endure the…

Computation and Language · Computer Science 2025-02-18 Zengkui Sun , Yijin Liu , Fandong Meng , Yufeng Chen , Jinan Xu , Jie Zhou