Self-Supervised Learning-Based Source Separation for Meeting Data

Yuang Li; Xianrui Zheng; Philip C. Woodland

Self-Supervised Learning-Based Source Separation for Meeting Data

Audio and Speech Processing 2023-04-04 v1

Authors: Yuang Li , Xianrui Zheng , Philip C. Woodland

Abstract

Source separation can improve automatic speech recognition (ASR) under multi-party meeting scenarios by extracting single-speaker signals from overlapped speech. Despite the success of self-supervised learning models in single-channel source separation, most studies have focused on simulated setups. In this paper, seven SSL models were compared on both simulated and real-world corpora. Then, we propose to integrate the best-performing model WavLM into an automatic transcription system through a novel iterative source selection method. To improve real-world performance, time-domain unsupervised mixture invariant training was adapted to the time-frequency domain. Experiments showed that in the transcription system when source separation was inserted before an ASR model fine-tuned on separated speech, absolute reductions of 1.9% and 1.5% in concatenated minimum-permutation word error rate for an unknown number of speakers (cpWER-us) were observed on the AMI dev and test sets.

Keywords

automatic speech recognition speech separation self-supervised speech learning

Cite

@article{arxiv.2304.00871,
  title  = {Self-Supervised Learning-Based Source Separation for Meeting Data},
  author = {Yuang Li and Xianrui Zheng and Philip C. Woodland},
  journal= {arXiv preprint arXiv:2304.00871},
  year   = {2023}
}

Comments

To appear in Proc. ICASSP2023

Self-Supervised Learning-Based Source Separation for Meeting Data

Abstract

Keywords

Cite

Comments

Related papers