Related papers: DQ-Data2vec: Decoupling Quantization for Multiling…

data2vec-aqc: Search for the right Teaching Assistant in the Teacher-Student training setup

In this paper, we propose a new Self-Supervised Learning (SSL) algorithm called data2vec-aqc, for speech representation learning from unlabeled speech data. Our goal is to improve SSL for speech in domains where both unlabeled and labeled…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-16 Vasista Sai Lodagala , Sreyan Ghosh , S. Umesh

Wav2vec-C: A Self-supervised Model for Speech Representation Learning

Wav2vec-C introduces a novel representation learning technique combining elements from wav2vec 2.0 and VQ-VAE. Our model learns to reproduce quantized representations from partially masked speech encoding using a contrastive loss in a way…

Audio and Speech Processing · Electrical Eng. & Systems 2021-06-25 Samik Sadhu , Di He , Che-Wei Huang , Sri Harish Mallidi , Minhua Wu , Ariya Rastrow , Andreas Stolcke , Jasha Droppo , Roland Maas

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self-supervised…

Machine Learning · Computer Science 2022-10-27 Alexei Baevski , Wei-Ning Hsu , Qiantong Xu , Arun Babu , Jiatao Gu , Michael Auli

Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes…

Machine Learning · Computer Science 2023-06-16 Alexei Baevski , Arun Babu , Wei-Ning Hsu , Michael Auli

Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition

The goal of self-supervised learning (SSL) for automatic speech recognition (ASR) is to learn good speech representations from a large amount of unlabeled speech for the downstream ASR task. However, most SSL frameworks do not consider…

Computation and Language · Computer Science 2022-01-27 Yiming Wang , Jinyu Li , Heming Wang , Yao Qian , Chengyi Wang , Yu Wu

Unsupervised Cross-lingual Representation Learning for Speech Recognition

This paper presents XLSR which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over…

Computation and Language · Computer Science 2020-12-17 Alexis Conneau , Alexei Baevski , Ronan Collobert , Abdelrahman Mohamed , Michael Auli

Quantizer-Aware Hierarchical Neural Codec Modeling for Speech Deepfake Detection

Neural audio codecs discretize speech via residual vector quantization (RVQ), forming a coarse-to-fine hierarchy across quantizers. While codec models have been explored for representation learning, their discrete structure remains…

Sound · Computer Science 2026-03-19 Jinyang Wu , Zihan Pan , Qiquan Zhang , Sailor Hardik Bhupendra , Soumik Mondal

Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations

Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition (ASR) and proved to be extremely useful in low label-resource settings. However, the…

Sound · Computer Science 2023-10-05 Weiwei Lin , Chenhang He , Man-Wai Mak , Youzhi Tu

Distilling a Pretrained Language Model to a Multilingual ASR Model

Multilingual speech data often suffer from long-tailed language distribution, resulting in performance degradation. However, multilingual text data is much easier to obtain, yielding a more useful general language model. Hence, we are…

Computation and Language · Computer Science 2022-06-28 Kwanghee Choi , Hyung-Min Park

Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition

State-of-the-art automatic speech recognition (ASR) systems perform well on healthy speech. However, the performance on impaired speech still remains an issue. The current study explores the usefulness of using Wav2Vec self-supervised…

Computation and Language · Computer Science 2022-04-05 Abner Hernandez , Paula Andrea Pérez-Toro , Elmar Nöth , Juan Rafael Orozco-Arroyave , Andreas Maier , Seung Hee Yang

Multilingual Speech Recognition Using Discrete Tokens with a Two-step Training Strategy

Pre-trained models, especially self-supervised learning (SSL) models, have demonstrated impressive results in automatic speech recognition (ASR) task. While most applications of SSL models focus on leveraging continuous representations as…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-03 Zehan Li , Yan Yang , Xueqing Li , Jian Kang , Xiao-Lei Zhang , Jie Li

MCR-Data2vec 2.0: Improving Self-supervised Speech Pre-training via Model-level Consistency Regularization

Self-supervised learning (SSL) has shown significant progress in speech processing tasks. However, despite the intrinsic randomness in the Transformer structure, such as dropout variants and layer-drop, improving the model-level consistency…

Audio and Speech Processing · Electrical Eng. & Systems 2023-06-16 Ji Won Yoon , Seok Min Kim , Nam Soo Kim

DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization

Recent success in speech representation learning enables a new way to leverage unlabeled data to train speech recognition model. In speech representation learning, a large amount of unlabeled data is used in a self-supervised manner to…

Audio and Speech Processing · Electrical Eng. & Systems 2020-12-15 Shaoshi Ling , Yuzong Liu

Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

Self-supervised-learning-based pre-trained models for speech data, such as Wav2Vec 2.0 (W2V2), have become the backbone of many speech tasks. In this paper, to achieve speaker diarisation and speech recognition using a single model, a…

Audio and Speech Processing · Electrical Eng. & Systems 2022-07-11 Xianrui Zheng , Chao Zhang , Philip C. Woodland

A Quantitative Approach to Understand Self-Supervised Models as Cross-lingual Feature Extractors

In this work, we study the features extracted by English self-supervised learning (SSL) models in cross-lingual contexts and propose a new metric to predict the quality of feature representations. Using automatic speech recognition (ASR) as…

Computation and Language · Computer Science 2023-11-28 Shuyue Stella Li , Beining Xu , Xiangyu Zhang , Hexin Liu , Wenhan Chao , Leibny Paola Garcia

Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation

In this work, we present a novel method, named AV2vec, for learning audio-visual speech representations by multimodal self-distillation. AV2vec has a student and a teacher module, in which the student performs a masked latent feature…

Audio and Speech Processing · Electrical Eng. & Systems 2022-12-07 Jing-Xuan Zhang , Genshun Wan , Zhen-Hua Ling , Jia Pan , Jianqing Gao , Cong Liu

Unsupervised Speech Recognition

Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe. This paper describes wav2vec-U,…

Computation and Language · Computer Science 2022-05-04 Alexei Baevski , Wei-Ning Hsu , Alexis Conneau , Michael Auli

Self-supervised Learning with Random-projection Quantizer for Speech Recognition

We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In…

Computation and Language · Computer Science 2022-07-01 Chung-Cheng Chiu , James Qin , Yu Zhang , Jiahui Yu , Yonghui Wu

Improving Mispronunciation Detection with Wav2vec2-based Momentum Pseudo-Labeling for Accentedness and Intelligibility Assessment

Current leading mispronunciation detection and diagnosis (MDD) systems achieve promising performance via end-to-end phoneme recognition. One challenge of such end-to-end solutions is the scarcity of human-annotated phonemes on natural L2…

Audio and Speech Processing · Electrical Eng. & Systems 2022-07-13 Mu Yang , Kevin Hirschi , Stephen D. Looney , Okim Kang , John H. L. Hansen

Phonetic-attention scoring for deep speaker features in speaker verification

Recent studies have shown that frame-level deep speaker features can be derived from a deep neural network with the training target set to discriminate speakers by a short speech segment. By pooling the frame-level features, utterance-level…

Audio and Speech Processing · Electrical Eng. & Systems 2018-11-09 Lantian Li , Zhiyuan Tang , Ying Shi , Dong Wang