Related papers: Predicting Multi-Codebook Vector Quantization Inde…

Compressing Visual-linguistic Model via Knowledge Distillation

Despite exciting progress in pre-training for visual-linguistic (VL) representations, very few aspire to a small VL model. In this paper, we study knowledge distillation (KD) to effectively compress a transformer-based large VL model into a…

Computer Vision and Pattern Recognition · Computer Science 2021-04-07 Zhiyuan Fang , Jianfeng Wang , Xiaowei Hu , Lijuan Wang , Yezhou Yang , Zicheng Liu

Knowledge Distillation from Multiple Foundation Models for End-to-End Speech Recognition

Although large foundation models pre-trained by self-supervised learning have achieved state-of-the-art performance in many tasks including automatic speech recognition (ASR), knowledge distillation (KD) is often required in practice to…

Audio and Speech Processing · Electrical Eng. & Systems 2023-03-21 Xiaoyu Yang , Qiujia Li , Chao Zhang , Philip C. Woodland

Knowledge Distillation for Neural Transducers from Large Self-Supervised Pre-trained Models

Self-supervised pre-training is an effective approach to leveraging a large amount of unlabelled data to reduce word error rates (WERs) of automatic speech recognition (ASR) systems. Since it is impractical to use large pre-trained models…

Audio and Speech Processing · Electrical Eng. & Systems 2022-03-03 Xiaoyu Yang , Qiujia Li , Philip C. Woodland

Integrated Multi-Level Knowledge Distillation for Enhanced Speaker Verification

Knowledge distillation (KD) is widely used in audio tasks, such as speaker verification (SV), by transferring knowledge from a well-trained large model (the teacher) to a smaller, more compact model (the student) for efficiency and…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-17 Wenhao Yang , Jianguo Wei , Wenhuan Lu , Xugang Lu , Lei Li

Bridging the Gap: Unpacking the Hidden Challenges in Knowledge Distillation for Online Ranking Systems

Knowledge Distillation (KD) is a powerful approach for compressing a large model into a smaller, more efficient model, particularly beneficial for latency-sensitive applications like recommender systems. However, current KD research…

Information Retrieval · Computer Science 2024-08-28 Nikhil Khani , Shuo Yang , Aniruddh Nath , Yang Liu , Pendo Abbo , Li Wei , Shawn Andrews , Maciej Kula , Jarrod Kahn , Zhe Zhao , Lichan Hong , Ed Chi

Revisiting Knowledge Distillation via Label Smoothing Regularization

Knowledge Distillation (KD) aims to distill the knowledge of a cumbersome teacher model into a lightweight student model. Its success is generally attributed to the privileged information on similarities among categories provided by the…

Computer Vision and Pattern Recognition · Computer Science 2021-03-05 Li Yuan , Francis E. H. Tay , Guilin Li , Tao Wang , Jiashi Feng

Improve Knowledge Distillation via Label Revision and Data Selection

Knowledge distillation (KD) has become a widely used technique in the field of model compression, which aims to transfer knowledge from a large teacher model to a lightweight student model for efficient network development. In addition to…

Machine Learning · Computer Science 2024-04-08 Weichao Lan , Yiu-ming Cheung , Qing Xu , Buhua Liu , Zhikai Hu , Mengke Li , Zhenghua Chen

An Empirical Study of Knowledge Distillation for Code Understanding Tasks

Pre-trained language models (PLMs) have emerged as powerful tools for code understanding. However, deploying these PLMs in large-scale applications faces practical challenges due to their computational intensity and inference latency.…

Software Engineering · Computer Science 2025-08-22 Ruiqi Wang , Zezhou Yang , Cuiyun Gao , Xin Xia , Qing Liao

Crossmodal Knowledge Distillation with WordNet-Relaxed Text Embeddings for Robust Image Classification

Crossmodal knowledge distillation (KD) aims to enhance a unimodal student using a multimodal teacher model. In particular, when the teacher's modalities include the student's, additional complementary information can be exploited to improve…

Computer Vision and Pattern Recognition · Computer Science 2025-04-01 Chenqi Guo , Mengshuo Rong , Qianli Feng , Rongfan Feng , Yinglong Ma

Knowledge Distillation Beyond Model Compression

Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or an ensemble of models (teacher). Various…

Machine Learning · Computer Science 2020-07-08 Fahad Sarfraz , Elahe Arani , Bahram Zonooz

Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification

Knowledge distillation (KD) is used to enhance automatic speaker verification performance by ensuring consistency between large teacher networks and lightweight student networks at the embedding level or label level. However, the…

Sound · Computer Science 2024-06-28 Duc-Tuan Truong , Ruijie Tao , Jia Qi Yip , Kong Aik Lee , Eng Siong Chng

Learning from a Lightweight Teacher for Efficient Knowledge Distillation

Knowledge Distillation (KD) is an effective framework for compressing deep learning models, realized by a student-teacher paradigm requiring small student networks to mimic the soft target generated by well-trained teachers. However, the…

Computer Vision and Pattern Recognition · Computer Science 2020-05-20 Yuang Liu , Wei Zhang , Jun Wang

Multi-Teacher Knowledge Distillation with Reinforcement Learning for Visual Recognition

Multi-teacher Knowledge Distillation (KD) transfers diverse knowledge from a teacher pool to a student network. The core problem of multi-teacher KD is how to balance distillation strengths among various teachers. Most existing methods…

Computer Vision and Pattern Recognition · Computer Science 2025-02-27 Chuanguang Yang , Xinqiang Yu , Han Yang , Zhulin An , Chengqing Yu , Libo Huang , Yongjun Xu

Revisiting Knowledge Distillation for Autoregressive Language Models

Knowledge distillation (KD) is a common approach to compress a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, in the context of autoregressive language models (LMs), we…

Computation and Language · Computer Science 2024-06-18 Qihuang Zhong , Liang Ding , Li Shen , Juhua Liu , Bo Du , Dacheng Tao

Creating a Good Teacher for Knowledge Distillation in Acoustic Scene Classification

Knowledge Distillation (KD) is a widespread technique for compressing the knowledge of large models into more compact and efficient models. KD has proved to be highly effective in building well-performing low-complexity Acoustic Scene…

Sound · Computer Science 2025-03-17 Tobias Morocutti , Florian Schmid , Khaled Koutini , Gerhard Widmer

The Role of Teacher Calibration in Knowledge Distillation

Knowledge Distillation (KD) has emerged as an effective model compression technique in deep learning, enabling the transfer of knowledge from a large teacher model to a compact student model. While KD has demonstrated significant success,…

Machine Learning · Computer Science 2025-08-29 Suyoung Kim , Seonguk Park , Junhoo Lee , Nojun Kwak

Dynamic Knowledge Distillation for Pre-trained Language Models

Knowledge distillation~(KD) has been proved effective for compressing large-scale pre-trained language models. However, existing methods conduct KD statically, e.g., the student model aligns its output distribution to that of a selected…

Computation and Language · Computer Science 2021-09-24 Lei Li , Yankai Lin , Shuhuai Ren , Peng Li , Jie Zhou , Xu Sun

Confidence-Aware Multi-Teacher Knowledge Distillation

Knowledge distillation is initially introduced to utilize additional supervision from a single teacher model for the student model training. To boost the student performance, some recent variants attempt to exploit diverse knowledge sources…

Machine Learning · Computer Science 2022-02-15 Hailin Zhang , Defang Chen , Can Wang

Multi-Hypothesis Distillation of Multilingual Neural Translation Models for Low-Resource Languages

This paper explores sequence-level knowledge distillation (KD) of multilingual pre-trained encoder-decoder translation models. We argue that the teacher model's output distribution holds valuable insights for the student, beyond the…

Computation and Language · Computer Science 2025-08-01 Aarón Galiano-Jiménez , Juan Antonio Pérez-Ortiz , Felipe Sánchez-Martínez , Víctor M. Sánchez-Cartagena

Enhancing Knowledge Distillation of Large Language Models through Efficient Multi-Modal Distribution Alignment

Knowledge distillation (KD) is an effective model compression method that can transfer the internal capabilities of large language models (LLMs) to smaller ones. However, the multi-modal probability distribution predicted by teacher LLMs…

Computation and Language · Computer Science 2024-12-19 Tianyu Peng , Jiajun Zhang