Related papers: Speech Emotion Recognition using Self-Supervised F…

End-to-End Integration of Speech Emotion Recognition with Voice Activity Detection using Self-Supervised Learning Features

Speech Emotion Recognition (SER) often operates on speech segments detected by a Voice Activity Detection (VAD) model. However, VAD models may output flawed speech segments, especially in noisy environments, resulting in degraded…

Sound · Computer Science 2024-10-18 Natsuo Yamashita , Masaaki Yamamoto , Yohei Kawaguchi

Speech Emotion Recognition with Co-Attention based Multi-level Acoustic Information

Speech Emotion Recognition (SER) aims to help the machine to understand human's subjective emotion from only audio information. However, extracting and utilizing comprehensive in-depth audio information is still a challenging task. In this…

Sound · Computer Science 2022-03-30 Heqing Zou , Yuke Si , Chen Chen , Deepu Rajan , Eng Siong Chng

Unsupervised Representations Improve Supervised Learning in Speech Emotion Recognition

Speech Emotion Recognition (SER) plays a pivotal role in enhancing human-computer interaction by enabling a deeper understanding of emotional states across a wide range of applications, contributing to more empathetic and effective…

Audio and Speech Processing · Electrical Eng. & Systems 2023-09-25 Amirali Soltani Tehrani , Niloufar Faridani , Ramin Toosi

Speaker Emotion Recognition: Leveraging Self-Supervised Models for Feature Extraction Using Wav2Vec2 and HuBERT

Speech is the most natural way of expressing ourselves as humans. Identifying emotion from speech is a nontrivial task due to the ambiguous definition of emotion itself. Speaker Emotion Recognition (SER) is essential for understanding human…

Sound · Computer Science 2024-11-07 Pourya Jafarzadeh , Amir Mohammad Rostami , Padideh Choobdar

Speech Emotion Recognition Using Deep Sparse Auto-Encoder Extreme Learning Machine with a New Weighting Scheme and Spectro-Temporal Features Along with Classical Feature Selection and A New Quantum-Inspired Dimension Reduction Method

Affective computing is very important in the relationship between man and machine. In this paper, a system for speech emotion recognition (SER) based on speech signal is proposed, which uses new techniques in different stages of processing.…

Sound · Computer Science 2021-11-16 Fatemeh Daneshfar , Seyed Jahanshah Kabudian

Self-Supervised learning with cross-modal transformers for emotion recognition

Emotion recognition is a challenging task due to limited availability of in-the-wild labeled datasets. Self-supervised learning has shown improvements on tasks with limited labeled datasets in domains like speech and natural language.…

Computation and Language · Computer Science 2021-04-08 Aparna Khare , Srinivas Parthasarathy , Shiva Sundaram

Recognizing More Emotions with Less Data Using Self-supervised Transfer Learning

We propose a novel transfer learning method for speech emotion recognition allowing us to obtain promising results when only few training data is available. With as low as 125 examples per emotion class, we were able to reach a higher…

Machine Learning · Computer Science 2020-11-12 Jonathan Boigne , Biman Liyanage , Ted Östrem

End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation

This work presents our end-to-end (E2E) automatic speech recognition (ASR) model targetting at robust speech recognition, called Integraded speech Recognition with enhanced speech Input for Self-supervised learning representation (IRIS).…

Sound · Computer Science 2022-04-04 Xuankai Chang , Takashi Maekaku , Yuya Fujita , Shinji Watanabe

End-to-end spoken language understanding using transformer networks and self-supervised pre-trained features

Transformer networks and self-supervised pre-training have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of spoken language understanding (SLU) still need…

Computation and Language · Computer Science 2020-11-18 Edmilson Morais , Hong-Kwang J. Kuo , Samuel Thomas , Zoltan Tuske , Brian Kingsbury

Modular End-to-end Automatic Speech Recognition Framework for Acoustic-to-word Model

End-to-end (E2E) systems have played a more and more important role in automatic speech recognition (ASR) and achieved great performance. However, E2E systems recognize output word sequences directly with the input acoustic feature, which…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-04 Qi Liu , Zhehuai Chen , Hao Li , Mingkun Huang , Yizhou Lu , Kai Yu

Metadata-Enhanced Speech Emotion Recognition: Augmented Residual Integration and Co-Attention in Two-Stage Fine-Tuning

Speech Emotion Recognition (SER) involves analyzing vocal expressions to determine the emotional state of speakers, where the comprehensive and thorough utilization of audio information is paramount. Therefore, we propose a novel approach…

Audio and Speech Processing · Electrical Eng. & Systems 2025-04-29 Zixiang Wan , Ziyue Qiu , Yiyang Liu , Wei-Qiang Zhang

End-to-end Acoustic-linguistic Emotion and Intent Recognition Enhanced by Semi-supervised Learning

Emotion and intent recognition from speech is essential and has been widely investigated in human-computer interaction. The rapid development of social media platforms, chatbots, and other technologies has led to a large volume of speech…

Sound · Computer Science 2025-07-11 Zhao Ren , Rathi Adarshi Rammohan , Kevin Scheck , Sheng Li , Tanja Schultz

Self-Supervised Learning for Audio-Based Emotion Recognition

Emotion recognition models using audio input data can enable the development of interactive systems with applications in mental healthcare, marketing, gaming, and social media analysis. While the field of affective computing using audio…

Sound · Computer Science 2023-07-25 Peranut Nimitsurachat , Peter Washington

Bimodal Speech Emotion Recognition Using Pre-Trained Language Models

Speech emotion recognition is a challenging task and an important step towards more natural human-machine interaction. We show that pre-trained language models can be fine-tuned for text emotion recognition, achieving an accuracy of 69.5%…

Audio and Speech Processing · Electrical Eng. & Systems 2019-12-06 Verena Heusser , Niklas Freymuth , Stefan Constantin , Alex Waibel

Multimodal Speech Emotion Recognition Using Modality-specific Self-Supervised Frameworks

Emotion recognition is a topic of significant interest in assistive robotics due to the need to equip robots with the ability to comprehend human behavior, facilitating their effective interaction in our society. Consequently, efficient and…

Human-Computer Interaction · Computer Science 2023-12-05 Rutherford Agbeshi Patamia , Paulo E. Santos , Kingsley Nketia Acheampong , Favour Ekong , Kwabena Sarpong , She Kun

Towards Interpretable and Transferable Speech Emotion Recognition: Latent Representation Based Analysis of Features, Methods and Corpora

In recent years, speech emotion recognition (SER) has been used in wide ranging applications, from healthcare to the commercial sector. In addition to signal processing approaches, methods for SER now also use deep learning techniques.…

Audio and Speech Processing · Electrical Eng. & Systems 2021-05-06 Sneha Das , Nicole Nadine Lønfeldt , Anne Katrine Pagsberg , Line H. Clemmensen

Improving Speech Emotion Recognition Through Cross Modal Attention Alignment and Balanced Stacking Model

Emotion plays a fundamental role in human interaction, and therefore systems capable of identifying emotions in speech are crucial in the context of human-computer interaction. Speech emotion recognition (SER) is a challenging problem,…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-03 Lucas Ueda , João Lima , Leonardo Marques , Paula Costa

Towards LLM-Empowered Fine-Grained Speech Descriptors for Explainable Emotion Recognition

This paper presents a novel end-to-end LLM-empowered explainable speech emotion recognition (SER) approach. Fine-grained speech emotion descriptor (SED) features, e.g., pitch, tone and emphasis, are disentangled from HuBERT SSL…

Sound · Computer Science 2025-05-30 Youjun Chen , Xurong Xie , Haoning Xu , Mengzhe Geng , Guinan Li , Chengxi Deng , Huimeng Wang , Shujie Hu , Xunying Liu

Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

Multilingual end-to-end (E2E) models have shown great promise in expansion of automatic speech recognition (ASR) coverage of the world's languages. They have shown improvement over monolingual systems, and have simplified training and…

Audio and Speech Processing · Electrical Eng. & Systems 2019-09-13 Anjuli Kannan , Arindrima Datta , Tara N. Sainath , Eugene Weinstein , Bhuvana Ramabhadran , Yonghui Wu , Ankur Bapna , Zhifeng Chen , Seungji Lee

ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition

Speech recognition applications cover a range of different audio and text distributions, with different speaking styles, background noise, transcription punctuation and character casing. However, many speech recognition systems require…

Computation and Language · Computer Science 2022-10-25 Sanchit Gandhi , Patrick von Platen , Alexander M. Rush