Related papers: Multi-Representation Knowledge Distillation For Au…

Audio-Visual Model Distillation Using Acoustic Images

In this paper, we investigate how to learn rich and robust feature representations for audio classification from visual data and acoustic images, a novel audio data modality. Former models learn audio representations from raw signals or…

Computer Vision and Pattern Recognition · Computer Science 2020-02-12 Andrés F. Pérez , Valentina Sanguineti , Pietro Morerio , Vittorio Murino

Teaching Audio Models to Reason: A Unified Framework for Source- and Layer-wise Distillation

While large audio language models excel at tasks like ASR and emotion recognition, they still struggle with complex reasoning due to the modality gap between audio and text as well as the lack of structured intermediate supervision. To…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-24 Runyan Yang , Yuke Si , Yingying Gao , Junlan Feng , Chao Deng , Shilei Zhang

Audio Embeddings as Teachers for Music Classification

Music classification has been one of the most popular tasks in the field of music information retrieval. With the development of deep learning models, the last decade has seen impressive improvements in a wide range of classification tasks.…

Sound · Computer Science 2023-07-03 Yiwei Ding , Alexander Lerch

Knowledge Distillation for Real-Time Classification of Early Media in Voice Communications

This paper investigates the industrial setting of real-time classification of early media exchanged during the initialization phase of voice calls. We explore the application of state-of-the-art audio tagging models and highlight some…

Sound · Computer Science 2025-07-28 Kemal Altwlkany , Hadžem Hadžić , Amar Kurić , Emanuel Lacic

Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models

Audio-visual representation learning is crucial for advancing multimodal speech processing tasks, such as lipreading and audio-visual speech recognition. Recently, speech foundation models (SFMs) have shown remarkable generalization…

Audio and Speech Processing · Electrical Eng. & Systems 2025-02-11 Jing-Xuan Zhang , Genshun Wan , Jianqing Gao , Zhen-Hua Ling

Audio Representation Learning by Distilling Video as Privileged Information

Deep audio representation learning using multi-modal audio-visual data often leads to a better performance compared to uni-modal approaches. However, in real-world scenarios both modalities are not always available at the time of inference,…

Sound · Computer Science 2023-02-07 Amirhossein Hajavi , Ali Etemad

Learning Representations for New Sound Classes With Continual Self-Supervised Learning

In this paper, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framework where the model can be updated without relying on labeled data. For this purpose, we propose…

Audio and Speech Processing · Electrical Eng. & Systems 2023-01-11 Zhepei Wang , Cem Subakan , Xilin Jiang , Junkai Wu , Efthymios Tzinis , Mirco Ravanelli , Paris Smaragdis

Knowledge Distillation for Efficient Audio-Visual Video Captioning

Automatically describing audio-visual content with texts, namely video captioning, has received significant attention due to its potential applications across diverse fields. Deep neural networks are the dominant methods, offering…

Audio and Speech Processing · Electrical Eng. & Systems 2023-06-19 Özkan Çaylı , Xubo Liu , Volkan Kılıç , Wenwu Wang

Distilling Knowledge from Ensembles of Acoustic Models for Joint CTC-Attention End-to-End Speech Recognition

Knowledge distillation has been widely used to compress existing deep learning models while preserving the performance on a wide range of applications. In the specific context of Automatic Speech Recognition (ASR), distillation from…

Machine Learning · Computer Science 2021-07-06 Yan Gao , Titouan Parcollet , Nicholas Lane

Representation Consolidation for Training Expert Students

Traditionally, distillation has been used to train a student model to emulate the input/output functionality of a teacher. A more useful goal than emulation, yet under-explored, is for the student to learn feature representations that…

Computer Vision and Pattern Recognition · Computer Science 2021-07-19 Zhizhong Li , Avinash Ravichandran , Charless Fowlkes , Marzia Polito , Rahul Bhotika , Stefano Soatto

Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

Having access to multi-modal cues (e.g. vision and audio) empowers some cognitive tasks to be done faster compared to learning from a single modality. In this work, we propose to transfer knowledge across heterogeneous modalities, even…

Computer Vision and Pattern Recognition · Computer Science 2021-04-23 Yanbei Chen , Yongqin Xian , A. Sophia Koepke , Ying Shan , Zeynep Akata

Knowledge distillation from language model to acoustic model: a hierarchical multi-task learning approach

The remarkable performance of the pre-trained language model (LM) using self-supervised learning has led to a major paradigm shift in the study of natural language processing. In line with these changes, leveraging the performance of speech…

Machine Learning · Computer Science 2021-10-22 Mun-Hak Lee , Joon-Hyuk Chang

Class-Incremental Learning for Multi-Label Audio Classification

In this paper, we propose a method for class-incremental learning of potentially overlapping sounds for solving a sequence of multi-label audio classification tasks. We design an incremental learner that learns new classes independently of…

Audio and Speech Processing · Electrical Eng. & Systems 2024-01-10 Manjunath Mulimani , Annamaria Mesaros

Multilingual Representation Distillation with Contrastive Learning

Multilingual sentence representations from large models encode semantic information from two or more languages and can be used for different cross-lingual information retrieval and matching tasks. In this paper, we integrate contrastive…

Computation and Language · Computer Science 2023-05-02 Weiting Tan , Kevin Heffernan , Holger Schwenk , Philipp Koehn

Collaborative Distillation Strategies for Parameter-Efficient Language Model Deployment

This paper addresses the challenges of high computational cost and slow inference in deploying large language models. It proposes a distillation strategy guided by multiple teacher models. The method constructs several teacher models and…

Computation and Language · Computer Science 2025-07-22 Xiandong Meng , Yan Wu , Yexin Tian , Xin Hu , Tianze Kang , Junliang Du

Temporal Knowledge Distillation for On-device Audio Classification

Improving the performance of on-device audio classification models remains a challenge given the computational limits of the mobile environment. Many studies leverage knowledge distillation to boost predictive performance by transferring…

Sound · Computer Science 2022-02-08 Kwanghee Choi , Martin Kersner , Jacob Morton , Buru Chang

ModalityMirror: Improving Audio Classification in Modality Heterogeneity Federated Learning with Multimodal Distillation

Multimodal Federated Learning frequently encounters challenges of client modality heterogeneity, leading to undesired performances for secondary modality in multimodal learning. It is particularly prevalent in audiovisual learning, with…

Audio and Speech Processing · Electrical Eng. & Systems 2024-08-29 Tiantian Feng , Tuo Zhang , Salman Avestimehr , Shrikanth S. Narayanan

Multi-Distillation from Speech and Music Representation Models

Real-world audio often mixes speech and music, yet models typically handle only one domain. This paper introduces a multi-teacher distillation framework that unifies speech and music models into a single one while significantly reducing…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-12 Jui-Chiang Wei , Yi-Cheng Lin , Fabian Ritter-Gutierrez , Hung-yi Lee

Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision

Humans do not acquire perceptual abilities in the way we train machines. While machine learning algorithms typically operate on large collections of randomly-chosen, explicitly-labeled examples, human acquisition relies more heavily on…

Sound · Computer Science 2019-11-15 Aren Jansen , Daniel P. W. Ellis , Shawn Hershey , R. Channing Moore , Manoj Plakal , Ashok C. Popat , Rif A. Saurous

Knowledge Distillation Leveraging Alternative Soft Targets from Non-Parallel Qualified Speech Data

This paper describes a novel knowledge distillation framework that leverages acoustically qualified speech data included in an existing training data pool as privileged information. In our proposed framework, a student network is trained…

Sound · Computer Science 2021-12-17 Tohru Nagano , Takashi Fukuda , Gakuto Kurata