Related papers: Efficient Sub-structured Knowledge Distillation

Structural Knowledge Distillation: Tractably Distilling Information for Structured Predictor

Knowledge distillation is a critical technique to transfer knowledge between models, typically from a large model (the teacher) to a more fine-grained one (the student). The objective function of knowledge distillation is typically the…

Computation and Language · Computer Science 2021-06-03 Xinyu Wang , Yong Jiang , Zhaohui Yan , Zixia Jia , Nguyen Bach , Tao Wang , Zhongqiang Huang , Fei Huang , Kewei Tu

Improved Knowledge Distillation for Pre-trained Language Models via Knowledge Selection

Knowledge distillation addresses the problem of transferring knowledge from a teacher model to a student model. In this process, we typically have multiple types of knowledge extracted from the teacher model. The problem is to make full use…

Computation and Language · Computer Science 2023-02-02 Chenglong Wang , Yi Lu , Yongyu Mu , Yimin Hu , Tong Xiao , Jingbo Zhu

Knowledge Distillation with the Reused Teacher Classifier

Knowledge distillation aims to compress a powerful yet cumbersome teacher model into a lightweight student model without much sacrifice of performance. For this purpose, various approaches have been proposed over the past few years,…

Computer Vision and Pattern Recognition · Computer Science 2022-03-29 Defang Chen , Jian-Ping Mei , Hailin Zhang , Can Wang , Yan Feng , Chun Chen

Structured Knowledge Distillation for Dense Prediction

In this work, we consider transferring the structure information from large networks to compact ones for dense prediction tasks in computer vision. Previous knowledge distillation strategies used for dense prediction tasks often directly…

Computer Vision and Pattern Recognition · Computer Science 2020-06-16 Yifan Liu , Changyong Shun , Jingdong Wang , Chunhua Shen

Efficient Knowledge Distillation from Model Checkpoints

Knowledge distillation is an effective approach to learn compact models (students) with the supervision of large and strong models (teachers). As empirically there exists a strong correlation between the performance of teacher and student…

Machine Learning · Computer Science 2022-10-13 Chaofei Wang , Qisen Yang , Rui Huang , Shiji Song , Gao Huang

Knowledge Distillation with Deep Supervision

Knowledge distillation aims to enhance the performance of a lightweight student model by exploiting the knowledge from a pre-trained cumbersome teacher model. However, in the traditional knowledge distillation, teacher predictions are only…

Machine Learning · Computer Science 2023-05-26 Shiya Luo , Defang Chen , Can Wang

Learning Student-Friendly Teacher Networks for Knowledge Distillation

We propose a novel knowledge distillation approach to facilitate the transfer of dark knowledge from a teacher to a student. Contrary to most of the existing methods that rely on effective training of student models given pretrained…

Machine Learning · Computer Science 2022-01-25 Dae Young Park , Moon-Hyun Cha , Changwook Jeong , Dae Sin Kim , Bohyung Han

Distilling Model Knowledge

Top-performing machine learning systems, such as deep neural networks, large ensembles and complex probabilistic graphical models, can be expensive to store, slow to evaluate and hard to integrate into larger systems. Ideally, we would like…

Machine Learning · Statistics 2015-10-09 George Papamakarios

Knowledge Distillation with Training Wheels

Knowledge distillation is used, in generative language modeling, to train a smaller student model using the help of a larger teacher model, resulting in improved capabilities for the student model. In this paper, we formulate a more general…

Computation and Language · Computer Science 2025-02-26 Guanlin Liu , Anand Ramachandran , Tanmay Gangwani , Yan Fu , Abhinav Sethy

Distilling Lightweight Domain Experts from Large ML Models by Identifying Relevant Subspaces

Knowledge distillation involves transferring the predictive capabilities of large, high-performing AI models (teachers) to smaller models (students) that can operate in environments with limited computing power. In this paper, we address…

Machine Learning · Computer Science 2026-01-12 Pattarawat Chormai , Ali Hashemi , Klaus-Robert Müller , Grégoire Montavon

Knowledge Distillation Meets Self-Supervision

Knowledge distillation, which involves extracting the "dark knowledge" from a teacher network to guide the learning of a student network, has emerged as an important technique for model compression and transfer learning. Unlike previous…

Computer Vision and Pattern Recognition · Computer Science 2020-07-14 Guodong Xu , Ziwei Liu , Xiaoxiao Li , Chen Change Loy

Distilling Knowledge for Search-based Structured Prediction

Many natural language processing tasks can be modeled into structured prediction and solved as a search problem. In this paper, we distill an ensemble of multiple models trained with different initialization into a single model. In addition…

Computation and Language · Computer Science 2018-05-30 Yijia Liu , Wanxiang Che , Huaipeng Zhao , Bing Qin , Ting Liu

Reinforced Multi-Teacher Selection for Knowledge Distillation

In natural language processing (NLP) tasks, slow inference speed and huge footprints in GPU usage remain the bottleneck of applying pre-trained deep models in production. As a popular method for model compression, knowledge distillation…

Computation and Language · Computer Science 2020-12-15 Fei Yuan , Linjun Shou , Jian Pei , Wutao Lin , Ming Gong , Yan Fu , Daxin Jiang

Knowledge Distillation from A Stronger Teacher

Unlike existing knowledge distillation methods focus on the baseline settings, where the teacher models and training strategies are not that strong and competing as state-of-the-art approaches, this paper presents a method dubbed DIST to…

Computer Vision and Pattern Recognition · Computer Science 2022-12-29 Tao Huang , Shan You , Fei Wang , Chen Qian , Chang Xu

MKD: a Multi-Task Knowledge Distillation Approach for Pretrained Language Models

Pretrained language models have led to significant performance gains in many NLP tasks. However, the intensive computing resources to train such models remain an issue. Knowledge distillation alleviates this problem by learning a…

Computation and Language · Computer Science 2020-05-04 Linqing Liu , Huan Wang , Jimmy Lin , Richard Socher , Caiming Xiong

A New Training Framework for Deep Neural Network

Knowledge distillation is the process of transferring the knowledge from a large model to a small model. In this process, the small model learns the generalization ability of the large model and retains the performance close to that of the…

Machine Learning · Computer Science 2021-03-26 Zhenyan Hou , Wenxuan Fan

Revisiting Self-Distillation

Knowledge distillation is the procedure of transferring "knowledge" from a large model (the teacher) to a more compact one (the student), often being used in the context of model compression. When both models have the same architecture,…

Machine Learning · Computer Science 2022-06-20 Minh Pham , Minsu Cho , Ameya Joshi , Chinmay Hegde

Knowledge Distillation as Semiparametric Inference

A popular approach to model compression is to train an inexpensive student model to mimic the class probabilities of a highly accurate but cumbersome teacher model. Surprisingly, this two-step knowledge distillation process often leads to…

Machine Learning · Statistics 2021-04-21 Tri Dao , Govinda M Kamath , Vasilis Syrgkanis , Lester Mackey

Knowledge Distillation Beyond Model Compression

Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or an ensemble of models (teacher). Various…

Machine Learning · Computer Science 2020-07-08 Fahad Sarfraz , Elahe Arani , Bahram Zonooz

Extracurricular Learning: Knowledge Transfer Beyond Empirical Distribution

Knowledge distillation has been used to transfer knowledge learned by a sophisticated model (teacher) to a simpler model (student). This technique is widely used to compress model complexity. However, in most applications the compressed…

Machine Learning · Computer Science 2020-11-24 Hadi Pouransari , Mojan Javaheripi , Vinay Sharma , Oncel Tuzel