English
Related papers

Related papers: Learning Speech Representations with Variational P…

200 papers

Speech foundation models, such as HuBERT and its variants, are pre-trained on large amounts of unlabeled speech data and then used for a range of downstream tasks. These models use a masked prediction objective, where the model learns to…

Audio and Speech Processing · Electrical Eng. & Systems 2025-01-22 Li-Wei Chen , Takuya Higuchi , He Bai , Ahmed Hussen Abdelaziz , Alexander Rudnicky , Shinji Watanabe , Tatiana Likhomanenko , Barry-John Theobald , Zakaria Aldeneh

Self-supervised models for speech representation learning now see widespread use for their versatility and performance on downstream tasks, but the effect of model architecture on the linguistic information learned in their representations…

Computation and Language · Computer Science 2025-08-12 Robin Huo , Ewan Dunbar

Training objectives based on predictive coding have recently been shown to be very effective at learning meaningful representations from unlabeled speech. One example is Autoregressive Predictive Coding (Chung et al., 2019), which trains an…

Audio and Speech Processing · Electrical Eng. & Systems 2020-04-14 Yu-An Chung , James Glass

Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase,…

Computation and Language · Computer Science 2021-06-15 Wei-Ning Hsu , Benjamin Bolte , Yao-Hung Hubert Tsai , Kushal Lakhotia , Ruslan Salakhutdinov , Abdelrahman Mohamed

Recent developments in pre-trained speech representation utilizing self-supervised learning (SSL) have yielded exceptional results on a variety of downstream tasks. One such technique, known as masked predictive coding (MPC), has been…

Sound · Computer Science 2024-01-12 Hemant Yadav , Sunayana Sitaram , Rajiv Ratn Shah

This paper introduces Relative Predictive Coding (RPC), a new contrastive representation learning objective that maintains a good balance among training stability, minibatch size sensitivity, and downstream task performance. The key to the…

Machine Learning · Computer Science 2021-04-14 Yao-Hung Hubert Tsai , Martin Q. Ma , Muqiao Yang , Han Zhao , Louis-Philippe Morency , Ruslan Salakhutdinov

Speech is the surface form of a finite set of phonetic units, which can be represented by discrete codes. We propose the Code BERT (CoBERT) approach for self-supervised speech representation learning. The idea is to convert an utterance to…

Sound · Computer Science 2023-07-06 Chutong Meng , Junyi Ao , Tom Ko , Mingxuan Wang , Haizhou Li

Learning meaningful and general representations from unannotated speech that are applicable to a wide range of tasks remains challenging. In this paper we propose to use autoregressive predictive coding (APC), a recently proposed…

Audio and Speech Processing · Electrical Eng. & Systems 2020-01-28 Yu-An Chung , James Glass

Generic pre-trained speech and text representations promise to reduce the need for large labeled datasets on specific speech and language tasks. However, it is not clear how to effectively adapt these representations for speech emotion…

Audio and Speech Processing · Electrical Eng. & Systems 2022-01-28 Sundararajan Srinivasan , Zhaocheng Huang , Katrin Kirchhoff

Previous speech pre-training methods, such as wav2vec2.0 and HuBERT, pre-train a Transformer encoder to learn deep representations from audio data, with objectives predicting either elements from latent vector quantized space or…

Sound · Computer Science 2022-04-08 Shuo Ren , Shujie Liu , Yu Wu , Long Zhou , Furu Wei

While several self-supervised approaches for learning discrete speech representation have been proposed, it is unclear how these seemingly similar approaches relate to each other. In this paper, we consider a generative model with discrete…

Computation and Language · Computer Science 2022-11-01 Sung-Lin Yeh , Hao Tang

Speech modeling methods learn one embedding for a fixed segment of speech, typically in between 10-25 ms. The information present in speech can be divided into two categories: "what is being said" (content) and "how it is expressed" (other)…

Computation and Language · Computer Science 2025-03-04 Hemant Yadav , Sunayana Sitaram , Rajiv Ratn Shah

Building a good speech recognition system usually requires large amounts of transcribed data, which is expensive to collect. To tackle this problem, many unsupervised pre-training methods have been proposed. Among these methods, Masked…

Audio and Speech Processing · Electrical Eng. & Systems 2020-06-24 Dongwei Jiang , Wubo Li , Ruixiong Zhang , Miao Cao , Ne Luo , Yang Han , Wei Zou , Xiangang Li

In recent years, self-supervised pre-training methods have gained significant traction in learning high-level information from raw speech. Among these methods, HuBERT has demonstrated SOTA performance in automatic speech recognition (ASR).…

Computation and Language · Computer Science 2025-02-19 Hemant Yadav , Sunayana Sitaram , Rajiv Ratn Shah

Self-supervised pre-trained speech models were shown effective for various downstream speech processing tasks. Since they are mainly pre-trained to map input speech to pseudo-labels, the resulting representations are only effective for the…

Audio and Speech Processing · Electrical Eng. & Systems 2023-11-09 Jingru Lin , Meng Ge , Wupeng Wang , Haizhou Li , Mengling Feng

Self-supervised learning has shown great success in Speech Recognition. However, it has been observed that finetuning all layers of the learned model leads to lower performance compared to resetting top layers. This phenomenon is attributed…

Computation and Language · Computer Science 2024-05-15 Valentin Vielzeuf

Disentangled representation learning aims to extract explanatory features or factors and retain salient information. Factorized hierarchical variational autoencoder (FHVAE) presents a way to disentangle a speech signal into sequential-level…

Audio and Speech Processing · Electrical Eng. & Systems 2022-04-06 Yuying Xie , Thomas Arildsen , Zheng-Hua Tan

Current language models are usually trained using a self-supervised scheme, where the main focus is learning representations at the word or sentence level. However, there has been limited progress in generating useful discourse-level…

Computation and Language · Computer Science 2021-09-13 Vladimir Araujo , Andrés Villa , Marcelo Mendoza , Marie-Francine Moens , Alvaro Soto

Voice conversion is the task to transform voice characteristics of source speech while preserving content information. Nowadays, self-supervised representation learning models are increasingly utilized in content extraction. However, in…

Sound · Computer Science 2024-05-02 Yimin Deng , Jianzong Wang , Xulong Zhang , Ning Cheng , Jing Xiao

Existing Self-Supervised Learning (SSL) models for speech typically process speech signals at a fixed resolution of 20 milliseconds. This approach overlooks the varying informational content present at different resolutions in speech…

Sound · Computer Science 2024-01-31 Jiatong Shi , Hirofumi Inaguma , Xutai Ma , Ilia Kulikov , Anna Sun
‹ Prev 1 2 3 10 Next ›