Related papers: Phoneme Segmentation Using Self-Supervised Speech …

Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation

We propose a self-supervised representation learning model for the task of unsupervised phoneme boundary detection. The model is a convolutional neural network that operates directly on the raw waveform. It is optimized to identify spectral…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-07 Felix Kreuk , Joseph Keshet , Yossi Adi

Phoneme Boundary Detection using Learnable Segmental Features

Phoneme boundary detection plays an essential first step for a variety of speech processing applications such as speaker diarization, speech science, keyword spotting, etc. In this work, we propose a neural architecture coupled with a…

Audio and Speech Processing · Electrical Eng. & Systems 2020-02-18 Felix Kreuk , Yaniv Sheena , Joseph Keshet , Yossi Adi

Back to Supervision: Boosting Word Boundary Detection through Frame Classification

Speech segmentation at both word and phoneme levels is crucial for various speech processing tasks. It significantly aids in extracting meaningful units from an utterance, thus enabling the generation of discrete elements. In this work we…

Machine Learning · Computer Science 2024-11-18 Simone Carnemolla , Salvatore Calcagno , Simone Palazzo , Daniela Giordano

Unsupervised Speech Recognition via Segmental Empirical Output Distribution Matching

We consider the problem of training speech recognition systems without using any labeled data, under the assumption that the learner can only access to the input utterances and a phoneme language model estimated from a non-overlapping…

Audio and Speech Processing · Electrical Eng. & Systems 2018-12-27 Chih-Kuan Yeh , Jianshu Chen , Chengzhu Yu , Dong Yu

Semi-supervised Learning with Sparse Autoencoders in Phone Classification

We propose the application of a semi-supervised learning method to improve the performance of acoustic modelling for automatic speech recognition based on deep neural net- works. As opposed to unsupervised initialisation followed by…

Machine Learning · Statistics 2016-10-04 Akash Kumar Dhaka , Giampiero Salvi

Supervised Acoustic Embeddings And Their Transferability Across Languages

In speech recognition, it is essential to model the phonetic content of the input signal while discarding irrelevant factors such as speaker variations and noise, which is challenging in low-resource settings. Self-supervised pre-training…

Computation and Language · Computer Science 2023-01-04 Sreepratha Ram , Hanan Aldarmaki

Unsupervised Speech Segmentation and Variable Rate Representation Learning using Segmental Contrastive Predictive Coding

Typically, unsupervised segmentation of speech into the phone and word-like units are treated as separate tasks and are often done via different methods which do not fully leverage the inter-dependence of the two tasks. Here, we unify them…

Audio and Speech Processing · Electrical Eng. & Systems 2021-10-12 Saurabhchand Bhati , Jesús Villalba , Piotr Żelasko , Laureano Moro-Velazquez , Najim Dehak

Self-supervised audio representation learning for mobile devices

We explore self-supervised models that can be potentially deployed on mobile devices to learn general purpose audio representations. Specifically, we propose methods that exploit the temporal context in the spectrogram domain. One method…

Audio and Speech Processing · Electrical Eng. & Systems 2019-05-29 Marco Tagliasacchi , Beat Gfeller , Félix de Chaumont Quitry , Dominik Roblek

Self-Expressing Autoencoders for Unsupervised Spoken Term Discovery

Unsupervised spoken term discovery consists of two tasks: finding the acoustic segment boundaries and labeling acoustically similar segments with the same labels. We perform segmentation based on the assumption that the frame feature…

Audio and Speech Processing · Electrical Eng. & Systems 2020-07-28 Saurabhchand Bhati , Jesús Villalba , Piotr Żelasko , Najim Dehak

Crossing the Species Divide: Transfer Learning from Speech to Animal Sounds

Self-supervised speech models have demonstrated impressive performance in speech processing, but their effectiveness on non-speech data remains underexplored. We study the transfer learning capabilities of such models on bioacoustic…

Machine Learning · Computer Science 2025-12-10 Jules Cauzinille , Marius Miron , Olivier Pietquin , Masato Hagiwara , Ricard Marxer , Arnaud Rey , Benoit Favre

BabyLM's First Words: Word Segmentation as a Phonological Probing Task

Language models provide a key framework for studying linguistic theories based on prediction, but phonological analysis using large language models (LLMs) is difficult; there are few phonological benchmarks beyond English and the standard…

Computation and Language · Computer Science 2025-06-13 Zébulon Goriely , Paula Buttery

What Do Self-Supervised Speech Models Know About Words?

Many self-supervised speech models (S3Ms) have been introduced over the last few years, improving performance and data efficiency on various speech tasks. However, these empirical successes alone do not give a complete picture of what is…

Computation and Language · Computer Science 2024-02-01 Ankita Pasad , Chung-Ming Chien , Shane Settle , Karen Livescu

Unsupervised Word Segmentation Using Temporal Gradient Pseudo-Labels

Unsupervised word segmentation in audio utterances is challenging as, in speech, there is typically no gap between words. In a preliminary experiment, we show that recent deep self-supervised features are very effective for word…

Audio and Speech Processing · Electrical Eng. & Systems 2023-04-04 Tzeviya Sylvia Fuchs , Yedid Hoshen

Stabilizing Label Assignment for Speech Separation by Self-supervised Pre-training

Speech separation has been well developed, with the very successful permutation invariant training (PIT) approach, although the frequent label assignment switching happening during PIT training remains to be a problem when better…

Sound · Computer Science 2021-08-24 Sung-Feng Huang , Shun-Po Chuang , Da-Rong Liu , Yi-Chen Chen , Gene-Ping Yang , Hung-yi Lee

Phoneme Based Neural Transducer for Large Vocabulary Speech Recognition

To join the advantages of classical and end-to-end approaches for speech recognition, we present a simple, novel and competitive approach for phoneme-based neural transducer modeling. Different alignment label topologies are compared and…

Computation and Language · Computer Science 2021-04-21 Wei Zhou , Simon Berger , Ralf Schlüter , Hermann Ney

TESSP: Text-Enhanced Self-Supervised Speech Pre-training

Self-supervised speech pre-training empowers the model with the contextual structure inherent in the speech signal while self-supervised text pre-training empowers the model with linguistic information. Both of them are beneficial for…

Sound · Computer Science 2022-11-28 Zhuoyuan Yao , Shuo Ren , Sanyuan Chen , Ziyang Ma , Pengcheng Guo , Lei Xie

Blind phoneme segmentation with temporal prediction errors

Phonemic segmentation of speech is a critical step of speech recognition systems. We propose a novel unsupervised algorithm based on sequence prediction models such as Markov chains and recurrent neural network. Our approach consists in…

Computation and Language · Computer Science 2017-05-30 Paul Michel , Okko Räsänen , Roland Thiollière , Emmanuel Dupoux

TIPAA-SSL: Text Independent Phone-to-Audio Alignment based on Self-Supervised Learning and Knowledge Transfer

In this paper, we present a novel approach for text independent phone-to-audio alignment based on phoneme recognition, representation learning and knowledge transfer. Our method leverages a self-supervised model (wav2vec2) fine-tuned for…

Audio and Speech Processing · Electrical Eng. & Systems 2024-05-06 Noé Tits , Prernna Bhatnagar , Thierry Dutoit

Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation

Automatic detection of phoneme or word-like units is one of the core objectives in zero-resource speech processing. Recent attempts employ self-supervised training methods, such as contrastive predictive coding (CPC), where the next frame…

Audio and Speech Processing · Electrical Eng. & Systems 2021-06-07 Saurabhchand Bhati , Jesús Villalba , Piotr Żelasko , Laureano Moro-Velazquez , Najim Dehak

Improving Speech Representation Learning via Speech-level and Phoneme-level Masking Approach

Recovering the masked speech frames is widely applied in speech representation learning. However, most of these models use random masking in the pre-training. In this work, we proposed two kinds of masking approaches: (1) speech-level…

Sound · Computer Science 2022-10-26 Xulong Zhang , Jianzong Wang , Ning Cheng , Kexin Zhu , Jing Xiao