Related papers: Learning Speech Representations with Variational P…

Exploring Prediction Targets in Masked Pre-Training for Speech Foundation Models

Speech foundation models, such as HuBERT and its variants, are pre-trained on large amounts of unlabeled speech data and then used for a range of downstream tasks. These models use a masked prediction objective, where the model learns to…

Audio and Speech Processing · Electrical Eng. & Systems 2025-01-22 Li-Wei Chen , Takuya Higuchi , He Bai , Ahmed Hussen Abdelaziz , Alexander Rudnicky , Shinji Watanabe , Tatiana Likhomanenko , Barry-John Theobald , Zakaria Aldeneh

Iterative refinement, not training objective, makes HuBERT behave differently from wav2vec 2.0

Self-supervised models for speech representation learning now see widespread use for their versatility and performance on downstream tasks, but the effect of model architecture on the linguistic information learned in their representations…

Computation and Language · Computer Science 2025-08-12 Robin Huo , Ewan Dunbar

Improved Speech Representations with Multi-Target Autoregressive Predictive Coding

Training objectives based on predictive coding have recently been shown to be very effective at learning meaningful representations from unlabeled speech. One example is Autoregressive Predictive Coding (Chung et al., 2019), which trains an…

Audio and Speech Processing · Electrical Eng. & Systems 2020-04-14 Yu-An Chung , James Glass

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase,…

Computation and Language · Computer Science 2021-06-15 Wei-Ning Hsu , Benjamin Bolte , Yao-Hung Hubert Tsai , Kushal Lakhotia , Ruslan Salakhutdinov , Abdelrahman Mohamed

Analysing the Masked predictive coding training criterion for pre-training a Speech Representation Model

Recent developments in pre-trained speech representation utilizing self-supervised learning (SSL) have yielded exceptional results on a variety of downstream tasks. One such technique, known as masked predictive coding (MPC), has been…

Sound · Computer Science 2024-01-12 Hemant Yadav , Sunayana Sitaram , Rajiv Ratn Shah

Self-supervised Representation Learning with Relative Predictive Coding

This paper introduces Relative Predictive Coding (RPC), a new contrastive representation learning objective that maintains a good balance among training stability, minibatch size sensitivity, and downstream task performance. The key to the…

Machine Learning · Computer Science 2021-04-14 Yao-Hung Hubert Tsai , Martin Q. Ma , Muqiao Yang , Han Zhao , Louis-Philippe Morency , Ruslan Salakhutdinov

CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning

Speech is the surface form of a finite set of phonetic units, which can be represented by discrete codes. We propose the Code BERT (CoBERT) approach for self-supervised speech representation learning. The idea is to convert an utterance to…

Sound · Computer Science 2023-07-06 Chutong Meng , Junyi Ao , Tom Ko , Mingxuan Wang , Haizhou Li

Generative Pre-Training for Speech with Autoregressive Predictive Coding

Learning meaningful and general representations from unannotated speech that are applicable to a wide range of tasks remains challenging. In this paper we propose to use autoregressive predictive coding (APC), a recently proposed…

Audio and Speech Processing · Electrical Eng. & Systems 2020-01-28 Yu-An Chung , James Glass

Representation learning through cross-modal conditional teacher-student training for speech emotion recognition

Generic pre-trained speech and text representations promise to reduce the need for large labeled datasets on specific speech and language tasks. However, it is not clear how to effectively adapt these representations for speech emotion…

Audio and Speech Processing · Electrical Eng. & Systems 2022-01-28 Sundararajan Srinivasan , Zhaocheng Huang , Katrin Kirchhoff

Speech Pre-training with Acoustic Piece

Previous speech pre-training methods, such as wav2vec2.0 and HuBERT, pre-train a Transformer encoder to learn deep representations from audio data, with objectives predicting either elements from latent vector quantized space or…

Sound · Computer Science 2022-04-08 Shuo Ren , Shujie Liu , Yu Wu , Long Zhou , Furu Wei

Autoregressive Co-Training for Learning Discrete Speech Representations

While several self-supervised approaches for learning discrete speech representation have been proposed, it is unclear how these seemingly similar approaches relate to each other. In this paper, we consider a generative model with discrete…

Computation and Language · Computer Science 2022-11-01 Sung-Lin Yeh , Hao Tang

Speech Representation Learning Revisited: The Necessity of Separate Learnable Parameters and Robust Data Augmentation

Speech modeling methods learn one embedding for a fixed segment of speech, typically in between 10-25 ms. The information present in speech can be divided into two categories: "what is being said" (content) and "how it is expressed" (other)…

Computation and Language · Computer Science 2025-03-04 Hemant Yadav , Sunayana Sitaram , Rajiv Ratn Shah

A Further Study of Unsupervised Pre-training for Transformer Based Speech Recognition

Building a good speech recognition system usually requires large amounts of transcribed data, which is expensive to collect. To tackle this problem, many unsupervised pre-training methods have been proposed. Among these methods, Masked…

Audio and Speech Processing · Electrical Eng. & Systems 2020-06-24 Dongwei Jiang , Wubo Li , Ruixiong Zhang , Miao Cao , Ne Luo , Yang Han , Wei Zou , Xiangang Li

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations

In recent years, self-supervised pre-training methods have gained significant traction in learning high-level information from raw speech. Among these methods, HuBERT has demonstrated SOTA performance in automatic speech recognition (ASR).…

Computation and Language · Computer Science 2025-02-19 Hemant Yadav , Sunayana Sitaram , Rajiv Ratn Shah

Selective HuBERT: Self-Supervised Pre-Training for Target Speaker in Clean and Mixture Speech

Self-supervised pre-trained speech models were shown effective for various downstream speech processing tasks. Since they are mainly pre-trained to map input speech to pseudo-labels, the resulting representations are only effective for the…

Audio and Speech Processing · Electrical Eng. & Systems 2023-11-09 Jingru Lin , Meng Ge , Wupeng Wang , Haizhou Li , Mengling Feng

Investigating the 'Autoencoder Behavior' in Speech Self-Supervised Models: a focus on HuBERT's Pretraining

Self-supervised learning has shown great success in Speech Recognition. However, it has been observed that finetuning all layers of the learned model leads to lower performance compared to resetting top layers. This phenomenon is attributed…

Computation and Language · Computer Science 2024-05-15 Valentin Vielzeuf

Disentangled Speech Representation Learning Based on Factorized Hierarchical Variational Autoencoder with Self-Supervised Objective

Disentangled representation learning aims to extract explanatory features or factors and retain salient information. Factorized hierarchical variational autoencoder (FHVAE) presents a way to disentangle a speech signal into sequential-level…

Audio and Speech Processing · Electrical Eng. & Systems 2022-04-06 Yuying Xie , Thomas Arildsen , Zheng-Hua Tan

Augmenting BERT-style Models with Predictive Coding to Improve Discourse-level Representations

Current language models are usually trained using a self-supervised scheme, where the main focus is learning representations at the word or sentence level. However, there has been limited progress in generating useful discourse-level…

Computation and Language · Computer Science 2021-09-13 Vladimir Araujo , Andrés Villa , Marcelo Mendoza , Marie-Francine Moens , Alvaro Soto

Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation

Voice conversion is the task to transform voice characteristics of source speech while preserving content information. Nowadays, self-supervised representation learning models are increasingly utilized in content extraction. However, in…

Sound · Computer Science 2024-05-02 Yimin Deng , Jianzong Wang , Xulong Zhang , Ning Cheng , Jing Xiao

Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction

Existing Self-Supervised Learning (SSL) models for speech typically process speech signals at a fixed resolution of 20 milliseconds. This approach overlooks the varying informational content present at different resolutions in speech…

Sound · Computer Science 2024-01-31 Jiatong Shi , Hirofumi Inaguma , Xutai Ma , Ilia Kulikov , Anna Sun