Related papers: Efficient Knowledge Distillation from Model Checkp…

Efficient Knowledge Distillation via Curriculum Extraction

Knowledge distillation is a technique used to train a small student network using the output generated by a large teacher network, and has many empirical advantages~\citep{Hinton2015DistillingTK}. While the standard one-shot approach to…

Machine Learning · Computer Science 2025-03-25 Shivam Gupta , Sushrut Karmalkar

Rethinking the Knowledge Distillation From the Perspective of Model Calibration

Recent years have witnessed dramatically improvements in the knowledge distillation, which can generate a compact student model for better efficiency while retaining the model effectiveness of the teacher model. Previous studies find that:…

Computer Vision and Pattern Recognition · Computer Science 2021-11-04 Lehan Yang , Jincen Song

Improved Knowledge Distillation for Pre-trained Language Models via Knowledge Selection

Knowledge distillation addresses the problem of transferring knowledge from a teacher model to a student model. In this process, we typically have multiple types of knowledge extracted from the teacher model. The problem is to make full use…

Computation and Language · Computer Science 2023-02-02 Chenglong Wang , Yi Lu , Yongyu Mu , Yimin Hu , Tong Xiao , Jingbo Zhu

Distilling Knowledge via Intermediate Classifiers

The crux of knowledge distillation is to effectively train a resource-limited student model with the guide of a pre-trained larger teacher model. However, when there is a large difference between the model complexities of teacher and…

Machine Learning · Computer Science 2021-06-01 Aryan Asadian , Amirali Salehi-Abari

Knowledge Distillation with the Reused Teacher Classifier

Knowledge distillation aims to compress a powerful yet cumbersome teacher model into a lightweight student model without much sacrifice of performance. For this purpose, various approaches have been proposed over the past few years,…

Computer Vision and Pattern Recognition · Computer Science 2022-03-29 Defang Chen , Jian-Ping Mei , Hailin Zhang , Can Wang , Yan Feng , Chun Chen

On the Efficacy of Knowledge Distillation

In this paper, we present a thorough evaluation of the efficacy of knowledge distillation and its dependence on student and teacher architectures. Starting with the observation that more accurate teachers often don't make good teachers, we…

Machine Learning · Computer Science 2019-10-04 Jang Hyun Cho , Bharath Hariharan

Does Knowledge Distillation Really Work?

Knowledge distillation is a popular technique for training a small student network to emulate a larger teacher model, such as an ensemble of networks. We show that while knowledge distillation can improve student generalization, it does not…

Machine Learning · Computer Science 2021-12-07 Samuel Stanton , Pavel Izmailov , Polina Kirichenko , Alexander A. Alemi , Andrew Gordon Wilson

Cooperative Knowledge Distillation: A Learner Agnostic Approach

Knowledge distillation is a simple but powerful way to transfer knowledge between a teacher model to a student model. Existing work suffers from at least one of the following key limitations in terms of direction and scope of transfer which…

Machine Learning · Computer Science 2024-02-12 Michael Livanos , Ian Davidson , Stephen Wong

Revisiting Self-Distillation

Knowledge distillation is the procedure of transferring "knowledge" from a large model (the teacher) to a more compact one (the student), often being used in the context of model compression. When both models have the same architecture,…

Machine Learning · Computer Science 2022-06-20 Minh Pham , Minsu Cho , Ameya Joshi , Chinmay Hegde

Knowledge Distillation via Weighted Ensemble of Teaching Assistants

Knowledge distillation in machine learning is the process of transferring knowledge from a large model called the teacher to a smaller model called the student. Knowledge distillation is one of the techniques to compress the large network…

Machine Learning · Computer Science 2022-06-27 Durga Prasad Ganta , Himel Das Gupta , Victor S. Sheng

Knowledge Distillation with Training Wheels

Knowledge distillation is used, in generative language modeling, to train a smaller student model using the help of a larger teacher model, resulting in improved capabilities for the student model. In this paper, we formulate a more general…

Computation and Language · Computer Science 2025-02-26 Guanlin Liu , Anand Ramachandran , Tanmay Gangwani , Yan Fu , Abhinav Sethy

Can a student Large Language Model perform as well as it's teacher?

The burgeoning complexity of contemporary deep learning models, while achieving unparalleled accuracy, has inadvertently introduced deployment challenges in resource-constrained environments. Knowledge distillation, a technique aiming to…

Machine Learning · Computer Science 2023-10-05 Sia Gholami , Marwan Omar

Teacher's pet: understanding and mitigating biases in distillation

Knowledge distillation is widely used as a means of improving the performance of a relatively simple student model using the predictions from a complex teacher model. Several works have shown that distillation significantly boosts the…

Machine Learning · Computer Science 2021-07-09 Michal Lukasik , Srinadh Bhojanapalli , Aditya Krishna Menon , Sanjiv Kumar

Knowledge Distillation from A Stronger Teacher

Unlike existing knowledge distillation methods focus on the baseline settings, where the teacher models and training strategies are not that strong and competing as state-of-the-art approaches, this paper presents a method dubbed DIST to…

Computer Vision and Pattern Recognition · Computer Science 2022-12-29 Tao Huang , Shan You , Fei Wang , Chen Qian , Chang Xu

Improved knowledge distillation by utilizing backward pass knowledge in neural networks

Knowledge distillation (KD) is one of the prominent techniques for model compression. In this method, the knowledge of a large network (teacher) is distilled into a model (student) with usually significantly fewer parameters. KD tries to…

Machine Learning · Computer Science 2023-01-31 Aref Jafari , Mehdi Rezagholizadeh , Ali Ghodsi

Hard Gate Knowledge Distillation -- Leverage Calibration for Robust and Reliable Language Model

In knowledge distillation, a student model is trained with supervisions from both knowledge from a teacher and observations drawn from a training data distribution. Knowledge of a teacher is considered a subject that holds inter-class…

Computation and Language · Computer Science 2022-10-25 Dongkyu Lee , Zhiliang Tian , Yingxiu Zhao , Ka Chun Cheung , Nevin L. Zhang

Learn From the Past: Experience Ensemble Knowledge Distillation

Traditional knowledge distillation transfers "dark knowledge" of a pre-trained teacher network to a student network, and ignores the knowledge in the training process of the teacher, which we call teacher's experience. However, in realistic…

Computer Vision and Pattern Recognition · Computer Science 2022-02-28 Chaofei Wang , Shaowei Zhang , Shiji Song , Gao Huang

Learning Student-Friendly Teacher Networks for Knowledge Distillation

We propose a novel knowledge distillation approach to facilitate the transfer of dark knowledge from a teacher to a student. Contrary to most of the existing methods that rely on effective training of student models given pretrained…

Machine Learning · Computer Science 2022-01-25 Dae Young Park , Moon-Hyun Cha , Changwook Jeong , Dae Sin Kim , Bohyung Han

Student Network Learning via Evolutionary Knowledge Distillation

Knowledge distillation provides an effective way to transfer knowledge via teacher-student learning, where most existing distillation approaches apply a fixed pre-trained model as teacher to supervise the learning of student network. This…

Machine Learning · Computer Science 2021-03-26 Kangkai Zhang , Chunhui Zhang , Shikun Li , Dan Zeng , Shiming Ge

Fixing the Teacher-Student Knowledge Discrepancy in Distillation

Training a small student network with the guidance of a larger teacher network is an effective way to promote the performance of the student. Despite the different types, the guided knowledge used to distill is always kept unchanged for…

Computer Vision and Pattern Recognition · Computer Science 2021-04-01 Jiangfan Han , Mengya Gao , Yujie Wang , Quanquan Li , Hongsheng Li , Xiaogang Wang