English

Phoneme Segmentation Using Self-Supervised Speech Models

Audio and Speech Processing 2022-11-04 v1 Computation and Language Sound

Abstract

We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of representations learned in self-supervised pre-training for the task. Our model extends transformer-style encoders with strategically placed convolutions that manipulate features learned in pre-training. Using the TIMIT and Buckeye corpora we train and test the model in the supervised and unsupervised settings. The latter case is accomplished by furnishing a noisy label-set with the predictions of a separate model, it having been trained in an unsupervised fashion. Results indicate our model eclipses previous state-of-the-art performance in both settings and on both datasets. Finally, following observations during published code review and attempts to reproduce past segmentation results, we find a need to disambiguate the definition and implementation of widely-used evaluation metrics. We resolve this ambiguity by delineating two distinct evaluation schemes and describing their nuances.

Keywords

Cite

@article{arxiv.2211.01461,
  title  = {Phoneme Segmentation Using Self-Supervised Speech Models},
  author = {Luke Strgar and David Harwath},
  journal= {arXiv preprint arXiv:2211.01461},
  year   = {2022}
}

Comments

Accepted to SLT 2022

R2 v1 2026-06-28T05:03:34.980Z