English

Self-supervised Learning with Speech Modulation Dropout

Audio and Speech Processing 2023-03-24 v1 Sound

Abstract

We show that training a multi-headed self-attention-based deep network to predict deleted, information-dense 2-8 Hz speech modulations over a 1.5-second section of a speech utterance is an effective way to make machines learn to extract speech modulations using time-domain contextual information. Our work exhibits that, once trained on large volumes of unlabelled data, the outputs of the self-attention layers vary in time with a modulation peak at 4 Hz. These pre-trained layers can be used to initialize parts of an Automatic Speech Recognition system to reduce its reliance on labeled speech data greatly.

Keywords

Cite

@article{arxiv.2303.12908,
  title  = {Self-supervised Learning with Speech Modulation Dropout},
  author = {Samik Sadhu and Hynek Hermansky},
  journal= {arXiv preprint arXiv:2303.12908},
  year   = {2023}
}
R2 v1 2026-06-28T09:28:56.265Z