Related papers: Optimizing Speech Multi-View Feature Fusion throug…

Two Views, One Truth: Spectral and Self-Supervised Features Fusion for Robust Speech Deepfake Detection

Recent advances in synthetic speech have made audio deepfakes increasingly realistic, posing significant security risks. Existing detection methods that rely on a single modality, either raw waveform embeddings or spectral based features,…

Sound · Computer Science 2025-07-29 Yassine El Kheir , Arnab Das , Enes Erdem Erdogan , Fabian Ritter-Guttierez , Tim Polzehl , Sebastian Möller

Exploring Effective Fusion Algorithms for Speech Based Self-Supervised Learning Models

Self-supervised learning (SSL) has achieved great success in various areas including speech processing. Recently, it is proven that speech based SSL models are able to extract superior universal representations on a range of downstream…

Sound · Computer Science 2022-12-21 Changli Tang , Yujin Wang , Xie Chen , Wei-Qiang Zhang

Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation

Self-Supervised Learning (SSL) models have been successfully applied in various deep learning-based speech tasks, particularly those with a limited amount of data. However, the quality of SSL representations depends highly on the…

Computation and Language · Computer Science 2022-04-20 Dan Berrebbi , Jiatong Shi , Brian Yan , Osbel Lopez-Francisco , Jonathan D. Amith , Shinji Watanabe

Fusion of Modulation Spectrogram and SSL with Multi-head Attention for Fake Speech Detection

Fake speech detection systems have become a necessity to combat against speech deepfakes. Current systems exhibit poor generalizability on out-of-domain speech samples due to lack to diverse training data. In this paper, we attempt to…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-27 Rishith Sadashiv T N , Abhishek Bedge , Saisha Suresh Bore , Jagabandhu Mishra , Mrinmoy Bhattacharjee , S R Mahadeva Prasanna

Adaptive Federated Fine-Tuning of Self-Supervised Speech Representations

Integrating Federated Learning (FL) with self-supervised learning (SSL) enables privacy-preserving fine-tuning for speech tasks. However, federated environments exhibit significant heterogeneity: clients differ in computational capacity,…

Audio and Speech Processing · Electrical Eng. & Systems 2026-03-26 Xin Guo , Chunrui Zhao , Hong Jia , Ting Dang , Gongping Huang , Xianrui Zheng , Yan Gao

EFFUSE: Efficient Self-Supervised Feature Fusion for E2E ASR in Low Resource and Multilingual Scenarios

Self-Supervised Learning (SSL) models have demonstrated exceptional performance in various speech tasks, particularly in low-resource and multilingual domains. Recent works show that fusing diverse SSL models could achieve superior…

Sound · Computer Science 2024-06-07 Tejes Srivastava , Jiatong Shi , William Chen , Shinji Watanabe

Enhancing Speech Emotion Recognition with Multi-Task Learning and Dynamic Feature Fusion

This study investigates fine-tuning self-supervised learn ing (SSL) models using multi-task learning (MTL) to enhance speech emotion recognition (SER). The framework simultane ously handles four related tasks: emotion recognition, gender…

Sound · Computer Science 2025-08-26 Honghong Wang , Jing Deng , Fanqin Meng , Rong Zheng

Fusion of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition

Self-supervised learning (SSL) models have shown exceptional capabilities across various speech-processing tasks. Continuous SSL representations are effective but suffer from high computational and storage demands. On the other hand,…

Sound · Computer Science 2024-11-28 Shih-heng Wang , Jiatong Shi , Chien-yu Huang , Shinji Watanabe , Hung-yi Lee

Rethinking Speech Representation Aggregation in Speech Enhancement: A Phonetic Mutual Information Perspective

Recent speech enhancement (SE) models increasingly leverage self-supervised learning (SSL) representations for their rich semantic information. Typically, intermediate features are aggregated into a single representation via a lightweight…

Sound · Computer Science 2026-02-02 Seungu Han , Sungho Lee , Kyogu Lee

Exploring Federated Self-Supervised Learning for General Purpose Audio Understanding

The integration of Federated Learning (FL) and Self-supervised Learning (SSL) offers a unique and synergetic combination to exploit the audio data for general-purpose audio understanding, without compromising user data privacy. However,…

Sound · Computer Science 2024-02-07 Yasar Abbas Ur Rehman , Kin Wai Lau , Yuyang Xie , Lan Ma , Jiajun Shen

Comparison of Speech Representations for the MOS Prediction System

Automatic methods to predict Mean Opinion Score (MOS) of listeners have been researched to assure the quality of Text-to-Speech systems. Many previous studies focus on architectural advances (e.g. MBNet, LDNet, etc.) to capture relations…

Sound · Computer Science 2022-06-29 Aki Kunikoshi , Jaebok Kim , Wonsuk Jun , Kåre Sjölander

Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion

Multimodal Large Language Models (MLLMs) have achieved notable success in enhancing translation performance by integrating multimodal information. However, existing research primarily focuses on image-guided methods, whose applicability is…

Computation and Language · Computer Science 2026-03-04 Yexing Du , Youcheng Pan , Zekun Wang , Zheng Chu , Yichong Huang , Kaiyuan Liu , Bo Yang , Yang Xiang , Ming Liu , Bing Qin

Jointly Fine-Tuning "BERT-like" Self Supervised Models to Improve Multimodal Speech Emotion Recognition

Multimodal emotion recognition from speech is an important area in affective computing. Fusing multiple data modalities and learning representations with limited amounts of labeled data is a challenging task. In this paper, we explore the…

Audio and Speech Processing · Electrical Eng. & Systems 2024-10-08 Shamane Siriwardhana , Andrew Reis , Rivindu Weerasekera , Suranga Nanayakkara

Fusion Self-supervised Learning for Recommendation

Recommender systems are widely deployed in various web environments, and self-supervised learning (SSL) has recently attracted significant attention in this field. Contrastive learning (CL) stands out as a major SSL paradigm due to its…

Information Retrieval · Computer Science 2025-01-17 Yu Zhang , Lei Sang , Yi Zhang , Yiwen Zhang , Yun Yang

Advancing automatic speech recognition using feature fusion with self-supervised learning features: A case study on Fearless Steps Apollo corpus

Using self-supervised learning (SSL) models has significantly improved performance for downstream speech tasks, surpassing the capabilities of traditional hand-crafted features. This study investigates the amalgamation of SSL models, with…

Audio and Speech Processing · Electrical Eng. & Systems 2026-04-27 Szu-Jui Chen , John H. L. Hansen

Exploring Efficient-tuning Methods in Self-supervised Speech Models

In this study, we aim to explore efficient tuning methods for speech self-supervised learning. Recent studies show that self-supervised learning (SSL) can learn powerful representations for different speech tasks. However, fine-tuning…

Audio and Speech Processing · Electrical Eng. & Systems 2023-01-31 Zih-Ching Chen , Chin-Lun Fu , Chih-Ying Liu , Shang-Wen Li , Hung-yi Lee

Federated Self-supervised Speech Representations: Are We There Yet?

The ubiquity of microphone-enabled devices has lead to large amounts of unlabelled audio data being produced at the edge. The integration of self-supervised learning (SSL) and federated learning (FL) into one coherent system can potentially…

Sound · Computer Science 2022-07-21 Yan Gao , Javier Fernandez-Marques , Titouan Parcollet , Abhinav Mehrotra , Nicholas D. Lane

Multimodal Fusion with Semi-Supervised Learning Minimizes Annotation Quantity for Modeling Videoconference Conversation Experience

Group conversations over videoconferencing are a complex social behavior. However, the subjective moments of negative experience, where the conversation loses fluidity or enjoyment remain understudied. These moments are infrequent in…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-20 Andrew Chang , Chenkai Hu , Ji Qi , Zhuojian Wei , Kexin Zhang , Viswadruth Akkaraju , David Poeppel , Dustin Freeman

SA-SSL-MOS: Self-supervised Learning MOS Prediction with Spectral Augmentation for Generalized Multi-Rate Speech Assessment

Designing a speech quality assessment (SQA) system for estimating mean-opinion-score (MOS) of multi-rate speech with varying sampling frequency (16-48 kHz) is a challenging task. The challenge arises due to the limited availability of a…

Audio and Speech Processing · Electrical Eng. & Systems 2026-02-17 Fengyuan Cao , Xinyu Liang , Fredrik Cumlin , Victor Ungureanu , Chandan K. A. Reddy , Christian Schuldt , Saikat Chatterjee

Interface Design for Self-Supervised Speech Models

Self-supervised speech (SSL) models have recently become widely adopted for many downstream speech processing tasks. The general usage pattern is to employ SSL models as feature extractors, and then train a downstream prediction head to…

Sound · Computer Science 2024-06-19 Yi-Jen Shih , David Harwath