Related papers: Generative Model-based Feature Knowledge Distillat…

Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models

Knowledge distillation (KD) is a technique for transferring knowledge from complex teacher models to simpler student models, significantly enhancing model efficiency and accuracy. It has demonstrated substantial advancements in various…

Computation and Language · Computer Science 2025-04-21 Junjie Yang , Junhao Song , Xudong Han , Ziqian Bi , Tianyang Wang , Chia Xin Liang , Xinyuan Song , Yichao Zhang , Qian Niu , Benji Peng , Keyu Chen , Ming Liu

Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation

Knowledge distillation (KD) has been widely applied in semantic segmentation to compress large models, but conventional approaches primarily preserve in-domain accuracy while neglecting out-of-domain generalization, which is essential under…

Computer Vision and Pattern Recognition · Computer Science 2026-03-04 Chonghua Lv , Dong Zhao , Shuang Wang , Dou Quan , Ning Huyan , Nicu Sebe , Zhun Zhong

Sample-level Adaptive Knowledge Distillation for Action Recognition

Knowledge Distillation (KD) compresses neural networks by learning a small network (student) via transferring knowledge from a pre-trained large network (teacher). Many endeavours have been devoted to the image domain, while few works focus…

Computer Vision and Pattern Recognition · Computer Science 2025-07-08 Ping Li , Chenhao Ping , Wenxiao Wang , Mingli Song

Knowledge Distillation Beyond Model Compression

Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or an ensemble of models (teacher). Various…

Machine Learning · Computer Science 2020-07-08 Fahad Sarfraz , Elahe Arani , Bahram Zonooz

Attention-based Knowledge Distillation in Multi-attention Tasks: The Impact of a DCT-driven Loss

Knowledge Distillation (KD) is a strategy for the definition of a set of transferability gangways to improve the efficiency of Convolutional Neural Networks. Feature-based Knowledge Distillation is a subfield of KD that relies on…

Computer Vision and Pattern Recognition · Computer Science 2022-06-07 Alejandro López-Cifuentes , Marcos Escudero-Viñolo , Jesús Bescós , Juan C. SanMiguel

Gradient Knowledge Distillation for Pre-trained Language Models

Knowledge distillation (KD) is an effective framework to transfer knowledge from a large-scale teacher to a compact yet well-performing student. Previous KD practices for pre-trained language models mainly transfer knowledge by aligning…

Computation and Language · Computer Science 2022-11-03 Lean Wang , Lei Li , Xu Sun

Learning an Augmented RGB Representation with Cross-Modal Knowledge Distillation for Action Detection

In video understanding, most cross-modal knowledge distillation (KD) methods are tailored for classification tasks, focusing on the discriminative representation of the trimmed videos. However, action detection requires not only…

Computer Vision and Pattern Recognition · Computer Science 2021-08-10 Rui Dai , Srijan Das , Francois Bremond

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

Knowledge distillation (KD) is widely used for compressing a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, current KD methods for auto-regressive sequence models suffer from…

Machine Learning · Computer Science 2024-01-18 Rishabh Agarwal , Nino Vieillard , Yongchao Zhou , Piotr Stanczyk , Sabela Ramos , Matthieu Geist , Olivier Bachem

Gradient-Guided Knowledge Distillation for Object Detectors

Deep learning models have demonstrated remarkable success in object detection, yet their complexity and computational intensity pose a barrier to deploying them in real-world applications (e.g., self-driving perception). Knowledge…

Computer Vision and Pattern Recognition · Computer Science 2023-03-09 Qizhen Lan , Qing Tian

KD-MRI: A knowledge distillation framework for image reconstruction and image restoration in MRI workflow

Deep learning networks are being developed in every stage of the MRI workflow and have provided state-of-the-art results. However, this has come at the cost of increased computation requirement and storage. Hence, replacing the networks…

Image and Video Processing · Electrical Eng. & Systems 2020-04-14 Balamurali Murugesan , Sricharan Vijayarangan , Kaushik Sarveswaran , Keerthi Ram , Mohanasankar Sivaprakasam

Modeling Teacher-Student Techniques in Deep Neural Networks for Knowledge Distillation

Knowledge distillation (KD) is a new method for transferring knowledge of a structure under training to another one. The typical application of KD is in the form of learning a small model (named as a student) by soft labels produced by a…

Computer Vision and Pattern Recognition · Computer Science 2020-01-01 Sajjad Abbasi , Mohsen Hajabdollahi , Nader Karimi , Shadrokh Samavi

Knowledge Distillation Performs Partial Variance Reduction

Knowledge distillation is a popular approach for enhancing the performance of ''student'' models, with lower representational capacity, by taking advantage of more powerful ''teacher'' models. Despite its apparent simplicity and widespread…

Machine Learning · Computer Science 2023-12-12 Mher Safaryan , Alexandra Peste , Dan Alistarh

Understanding and Improving Knowledge Distillation

Knowledge Distillation (KD) is a model-agnostic technique to improve model quality while having a fixed capacity budget. It is a commonly used technique for model compression, where a larger capacity teacher model with better quality is…

Machine Learning · Computer Science 2021-03-02 Jiaxi Tang , Rakesh Shivanna , Zhe Zhao , Dong Lin , Anima Singh , Ed H. Chi , Sagar Jain

Knowledge Distillation Layer that Lets the Student Decide

Typical technique in knowledge distillation (KD) is regularizing the learning of a limited capacity model (student) by pushing its responses to match a powerful model's (teacher). Albeit useful especially in the penultimate layer and…

Computer Vision and Pattern Recognition · Computer Science 2023-09-08 Ada Gorgun , Yeti Z. Gurbuz , A. Aydin Alatan

Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation

Knowledge distillation (KD) is a core component in the training and deployment of modern generative models, particularly large language models (LLMs). While its empirical benefits are well documented -- enabling smaller student models to…

Machine Learning · Computer Science 2026-01-16 Sungmin Cha , Kyunghyun Cho

Heterogeneous Knowledge Distillation using Information Flow Modeling

Knowledge Distillation (KD) methods are capable of transferring the knowledge encoded in a large and complex teacher into a smaller and faster student. Early methods were usually limited to transferring the knowledge only between the last…

Computer Vision and Pattern Recognition · Computer Science 2020-05-05 Nikolaos Passalis , Maria Tzelepi , Anastasios Tefas

Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks

Deep neural models in recent years have been successful in almost every field, including extremely complex problem statements. However, these models are huge in size, with millions (and even billions) of parameters, thus demanding more…

Computer Vision and Pattern Recognition · Computer Science 2021-06-18 Lin Wang , Kuk-Jin Yoon

An Embarrassingly Simple Approach for Knowledge Distillation

Knowledge Distillation (KD) aims at improving the performance of a low-capacity student model by inheriting knowledge from a high-capacity teacher model. Previous KD methods typically train a student by minimizing a task-related loss and…

Computer Vision and Pattern Recognition · Computer Science 2019-09-10 Mengya Gao , Yujun Shen , Quanquan Li , Junjie Yan , Liang Wan , Dahua Lin , Chen Change Loy , Xiaoou Tang

Graph-based Knowledge Distillation by Multi-head Attention Network

Knowledge distillation (KD) is a technique to derive optimal performance from a small student network (SN) by distilling knowledge of a large teacher network (TN) and transferring the distilled knowledge to the small SN. Since a role of…

Machine Learning · Computer Science 2019-07-10 Seunghyun Lee , Byung Cheol Song

An Empirical Study of Knowledge Distillation for Code Understanding Tasks

Pre-trained language models (PLMs) have emerged as powerful tools for code understanding. However, deploying these PLMs in large-scale applications faces practical challenges due to their computational intensity and inference latency.…

Software Engineering · Computer Science 2025-08-22 Ruiqi Wang , Zezhou Yang , Cuiyun Gao , Xin Xia , Qing Liao