Related papers: Subclass Distillation

Teacher's pet: understanding and mitigating biases in distillation

Knowledge distillation is widely used as a means of improving the performance of a relatively simple student model using the predictions from a complex teacher model. Several works have shown that distillation significantly boosts the…

Machine Learning · Computer Science 2021-07-09 Michal Lukasik , Srinadh Bhojanapalli , Aditya Krishna Menon , Sanjiv Kumar

Does Knowledge Distillation Really Work?

Knowledge distillation is a popular technique for training a small student network to emulate a larger teacher model, such as an ensemble of networks. We show that while knowledge distillation can improve student generalization, it does not…

Machine Learning · Computer Science 2021-12-07 Samuel Stanton , Pavel Izmailov , Polina Kirichenko , Alexander A. Alemi , Andrew Gordon Wilson

Why distillation helps: a statistical perspective

Knowledge distillation is a technique for improving the performance of a simple "student" model by replacing its one-hot training labels with a distribution over labels obtained from a complex "teacher" model. While this simple approach has…

Machine Learning · Computer Science 2020-05-22 Aditya Krishna Menon , Ankit Singh Rawat , Sashank J. Reddi , Seungyeon Kim , Sanjiv Kumar

Distilling Double Descent

Distillation is the technique of training a "student" model based on examples that are labeled by a separate "teacher" model, which itself is trained on a labeled dataset. The most common explanations for why distillation "works" are…

Machine Learning · Computer Science 2021-02-16 Andrew Cotter , Aditya Krishna Menon , Harikrishna Narasimhan , Ankit Singh Rawat , Sashank J. Reddi , Yichen Zhou

Dataset distillation for memorized data: Soft labels can leak held-out teacher knowledge

Dataset distillation aims to compress training data into fewer examples via a teacher, from which a student can learn effectively. While its success is often attributed to structure in the data, modern neural networks also memorize specific…

Machine Learning · Computer Science 2026-02-23 Freya Behrens , Lenka Zdeborová

Knowledge Distillation as Semiparametric Inference

A popular approach to model compression is to train an inexpensive student model to mimic the class probabilities of a highly accurate but cumbersome teacher model. Surprisingly, this two-step knowledge distillation process often leads to…

Machine Learning · Statistics 2021-04-21 Tri Dao , Govinda M Kamath , Vasilis Syrgkanis , Lester Mackey

Revisiting Self-Distillation

Knowledge distillation is the procedure of transferring "knowledge" from a large model (the teacher) to a more compact one (the student), often being used in the context of model compression. When both models have the same architecture,…

Machine Learning · Computer Science 2022-06-20 Minh Pham , Minsu Cho , Ameya Joshi , Chinmay Hegde

Extracurricular Learning: Knowledge Transfer Beyond Empirical Distribution

Knowledge distillation has been used to transfer knowledge learned by a sophisticated model (teacher) to a simpler model (student). This technique is widely used to compress model complexity. However, in most applications the compressed…

Machine Learning · Computer Science 2020-11-24 Hadi Pouransari , Mojan Javaheripi , Vinay Sharma , Oncel Tuzel

Knowledge distillation is a widely applicable technique for training a student neural network under the guidance of a trained teacher network. For example, in neural network compression, a high-capacity teacher is distilled to train a…

Computer Vision and Pattern Recognition · Computer Science 2019-08-05 Frederick Tung , Greg Mori

Distillation from heterogeneous unlabeled collections

Compressing deep networks is essential to expand their range of applications to constrained settings. The need for compression however often arises long after the model was trained, when the original data might no longer be available. On…

Machine Learning · Computer Science 2022-01-19 Jean-Michel Begon , Pierre Geurts

Student Network Learning via Evolutionary Knowledge Distillation

Knowledge distillation provides an effective way to transfer knowledge via teacher-student learning, where most existing distillation approaches apply a fixed pre-trained model as teacher to supervise the learning of student network. This…

Machine Learning · Computer Science 2021-03-26 Kangkai Zhang , Chunhui Zhang , Shikun Li , Dan Zeng , Shiming Ge

Prune Your Model Before Distill It

Knowledge distillation transfers the knowledge from a cumbersome teacher to a small student. Recent results suggest that the student-friendly teacher is more appropriate to distill since it provides more transferable knowledge. In this…

Machine Learning · Computer Science 2022-07-26 Jinhyuk Park , Albert No

Distilling Lightweight Domain Experts from Large ML Models by Identifying Relevant Subspaces

Knowledge distillation involves transferring the predictive capabilities of large, high-performing AI models (teachers) to smaller models (students) that can operate in environments with limited computing power. In this paper, we address…

Machine Learning · Computer Science 2026-01-12 Pattarawat Chormai , Ali Hashemi , Klaus-Robert Müller , Grégoire Montavon

Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer

Language models can transfer hidden biases during distillation. For example, a teacher that "likes owls" can make its student "like owls" too, even when the training data consists only of lists of numbers. This surprising phenomenon is…

Machine Learning · Computer Science 2026-03-06 Simon Schrodi , Elias Kempf , Fazl Barez , Thomas Brox

What Knowledge Gets Distilled in Knowledge Distillation?

Knowledge distillation aims to transfer useful information from a teacher network to a student network, with the primary goal of improving the student's performance for the task at hand. Over the years, there has a been a deluge of novel…

Computer Vision and Pattern Recognition · Computer Science 2023-11-07 Utkarsh Ojha , Yuheng Li , Anirudh Sundara Rajan , Yingyu Liang , Yong Jae Lee

Learning Student-Friendly Teacher Networks for Knowledge Distillation

We propose a novel knowledge distillation approach to facilitate the transfer of dark knowledge from a teacher to a student. Contrary to most of the existing methods that rely on effective training of student models given pretrained…

Machine Learning · Computer Science 2022-01-25 Dae Young Park , Moon-Hyun Cha , Changwook Jeong , Dae Sin Kim , Bohyung Han

Knowledge Distillation with the Reused Teacher Classifier

Knowledge distillation aims to compress a powerful yet cumbersome teacher model into a lightweight student model without much sacrifice of performance. For this purpose, various approaches have been proposed over the past few years,…

Computer Vision and Pattern Recognition · Computer Science 2022-03-29 Defang Chen , Jian-Ping Mei , Hailin Zhang , Can Wang , Yan Feng , Chun Chen

Representation Consolidation for Training Expert Students

Traditionally, distillation has been used to train a student model to emulate the input/output functionality of a teacher. A more useful goal than emulation, yet under-explored, is for the student to learn feature representations that…

Computer Vision and Pattern Recognition · Computer Science 2021-07-19 Zhizhong Li , Avinash Ravichandran , Charless Fowlkes , Marzia Polito , Rahul Bhotika , Stefano Soatto

Reinforced Multi-Teacher Selection for Knowledge Distillation

In natural language processing (NLP) tasks, slow inference speed and huge footprints in GPU usage remain the bottleneck of applying pre-trained deep models in production. As a popular method for model compression, knowledge distillation…

Computation and Language · Computer Science 2020-12-15 Fei Yuan , Linjun Shou , Jian Pei , Wutao Lin , Ming Gong , Yan Fu , Daxin Jiang

Improved Knowledge Distillation for Pre-trained Language Models via Knowledge Selection

Knowledge distillation addresses the problem of transferring knowledge from a teacher model to a student model. In this process, we typically have multiple types of knowledge extracted from the teacher model. The problem is to make full use…

Computation and Language · Computer Science 2023-02-02 Chenglong Wang , Yi Lu , Yongyu Mu , Yimin Hu , Tong Xiao , Jingbo Zhu