English

Speech Emotion Recognition using Self-Supervised Features

Sound 2022-02-09 v1 Artificial Intelligence Machine Learning Audio and Speech Processing

Abstract

Self-supervised pre-trained features have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of speech emotion recognition (SER) still need further investigation. In this paper we introduce a modular End-to- End (E2E) SER system based on an Upstream + Downstream architecture paradigm, which allows easy use/integration of a large variety of self-supervised features. Several SER experiments for predicting categorical emotion classes from the IEMOCAP dataset are performed. These experiments investigate interactions among fine-tuning of self-supervised feature models, aggregation of frame-level features into utterance-level features and back-end classification networks. The proposed monomodal speechonly based system not only achieves SOTA results, but also brings light to the possibility of powerful and well finetuned self-supervised acoustic features that reach results similar to the results achieved by SOTA multimodal systems using both Speech and Text modalities.

Keywords

Cite

@article{arxiv.2202.03896,
  title  = {Speech Emotion Recognition using Self-Supervised Features},
  author = {Edmilson Morais and Ron Hoory and Weizhong Zhu and Itai Gat and Matheus Damasceno and Hagai Aronowitz},
  journal= {arXiv preprint arXiv:2202.03896},
  year   = {2022}
}

Comments

5 pages, 4 figures, 2 tables, ICASSP 2022

R2 v1 2026-06-24T09:26:22.654Z