Related papers: Multi-task Voice Activated Framework using Self-su…

Exploring wav2vec 2.0 on speaker verification and language identification

Wav2vec 2.0 is a recently proposed self-supervised framework for speech representation learning. It follows a two-stage training process of pre-training and fine-tuning, and performs well in speech recognition tasks especially ultra-low…

Sound · Computer Science 2021-01-15 Zhiyun Fan , Meng Li , Shiyu Zhou , Bo Xu

Multitask Detection of Speaker Changes, Overlapping Speech and Voice Activity Using wav2vec 2.0

Self-supervised learning approaches have lately achieved great success on a broad spectrum of machine learning problems. In the field of speech processing, one of the most successful recent self-supervised models is wav2vec 2.0. In this…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-10 Marie Kunešová , Zbyněk Zajíc

Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes…

Machine Learning · Computer Science 2023-06-16 Alexei Baevski , Arun Babu , Wei-Ning Hsu , Michael Auli

Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks

Representation learning from unlabeled data has been of major interest in artificial intelligence research. While self-supervised speech representation learning has been popular in the speech research community, very few works have…

Sound · Computer Science 2022-01-10 Sangeeta Srivastava , Yun Wang , Andros Tjandra , Anurag Kumar , Chunxi Liu , Kritika Singh , Yatharth Saraf

Layer-wise Analysis of a Self-supervised Speech Representation Model

Recently proposed self-supervised learning approaches have been successful for pre-training speech representation models. The utility of these learned representations has been observed empirically, but not much has been studied about the…

Computation and Language · Computer Science 2022-12-06 Ankita Pasad , Ju-Chieh Chou , Karen Livescu

Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

Self-supervised speech pre-training methods have developed rapidly in recent years, which show to be very effective for many near-field single-channel speech tasks. However, far-field multichannel speech processing is suffering from the…

Audio and Speech Processing · Electrical Eng. & Systems 2024-01-09 Qiushi Zhu , Jie Zhang , Yu Gu , Yuchen Hu , Lirong Dai

Wav2vec-C: A Self-supervised Model for Speech Representation Learning

Wav2vec-C introduces a novel representation learning technique combining elements from wav2vec 2.0 and VQ-VAE. Our model learns to reproduce quantized representations from partially masked speech encoding using a contrastive loss in a way…

Audio and Speech Processing · Electrical Eng. & Systems 2021-06-25 Samik Sadhu , Di He , Che-Wei Huang , Sri Harish Mallidi , Minhua Wu , Ariya Rastrow , Andreas Stolcke , Jasha Droppo , Roland Maas

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the…

Computation and Language · Computer Science 2020-10-23 Alexei Baevski , Henry Zhou , Abdelrahman Mohamed , Michael Auli

Towards multi-task learning of speech and speaker recognition

We study multi-task learning for two orthogonal speech technology tasks: speech and speaker recognition. We use wav2vec2 as a base architecture with two task-specific output heads. We experiment with different architectural decisions to mix…

Sound · Computer Science 2023-05-29 Nik Vaessen , David A. van Leeuwen

Improved Language Identification Through Cross-Lingual Self-Supervised Learning

Language identification greatly impacts the success of downstream tasks such as automatic speech recognition. Recently, self-supervised speech representations learned by wav2vec 2.0 have been shown to be very effective for a range of speech…

Computation and Language · Computer Science 2021-10-19 Andros Tjandra , Diptanu Gon Choudhury , Frank Zhang , Kritika Singh , Alexis Conneau , Alexei Baevski , Assaf Sela , Yatharth Saraf , Michael Auli

A Noise-Robust Self-supervised Pre-training Model Based Speech Representation Learning for Automatic Speech Recognition

Wav2vec2.0 is a popular self-supervised pre-training framework for learning speech representations in the context of automatic speech recognition (ASR). It was shown that wav2vec2.0 has a good robustness against the domain shift, while the…

Audio and Speech Processing · Electrical Eng. & Systems 2022-05-10 Qiu-Shi Zhu , Jie Zhang , Zi-Qiang Zhang , Ming-Hui Wu , Xin Fang , Li-Rong Dai

Simultaneous or Sequential Training? How Speech Representations Cooperate in a Multi-Task Self-Supervised Learning System

Speech representation learning with self-supervised algorithms has resulted in notable performance boosts in many downstream tasks. Recent work combined self-supervised learning (SSL) and visually grounded speech (VGS) processing mechanisms…

Audio and Speech Processing · Electrical Eng. & Systems 2024-03-08 Khazar Khorrami , María Andrea Cruz Blandón , Tuomas Virtanen , Okko Räsänen

wav2vec: Unsupervised Pre-training for Speech Recognition

We explore unsupervised pre-training for speech recognition by learning representations of raw audio. wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model…

Computation and Language · Computer Science 2019-09-12 Steffen Schneider , Alexei Baevski , Ronan Collobert , Michael Auli

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self-supervised…

Machine Learning · Computer Science 2022-10-27 Alexei Baevski , Wei-Ning Hsu , Qiantong Xu , Arun Babu , Jiatao Gu , Michael Auli

Wav2Vec2.0 on the Edge: Performance Evaluation

Wav2Vec2.0 is a state-of-the-art model which learns speech representations through unlabeled speech data, aka, self supervised learning. The pretrained model is then fine tuned on small amounts of labeled data to use it for speech-to-text…

Sound · Computer Science 2022-02-15 Santosh Gondi

AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint…

Audio and Speech Processing · Electrical Eng. & Systems 2024-01-23 Jiachen Lian , Alexei Baevski , Wei-Ning Hsu , Michael Auli

Toward a realistic model of speech processing in the brain with self-supervised learning

Several deep neural networks have recently been shown to generate activations similar to those of the brain in response to the same input. These algorithms, however, remain largely implausible: they require (1) extraordinarily large amounts…

Neurons and Cognition · Quantitative Biology 2023-03-21 Juliette Millet , Charlotte Caucheteux , Pierre Orhan , Yves Boubenec , Alexandre Gramfort , Ewan Dunbar , Christophe Pallier , Jean-Remi King

Interpretable Temporal Class Activation Representation for Audio Spoofing Detection

Explaining the decisions made by audio spoofing detection models is crucial for fostering trust in detection outcomes. However, current research on the interpretability of detection models is limited to applying XAI tools to post-trained…

Sound · Computer Science 2025-07-28 Menglu Li , Xiao-Ping Zhang

Adaptive multilingual speech recognition with pretrained models

Multilingual speech recognition with supervised learning has achieved great results as reflected in recent research. With the development of pretraining methods on audio and text data, it is imperative to transfer the knowledge from…

Computation and Language · Computer Science 2022-05-26 Ngoc-Quan Pham , Alex Waibel , Jan Niehues

Self-Supervised Learning for Multi-Channel Neural Transducer

Self-supervised learning, such as with the wav2vec 2.0 framework significantly improves the accuracy of end-to-end automatic speech recognition (ASR). Wav2vec 2.0 has been applied to single-channel end-to-end ASR models. In this work, we…

Computation and Language · Computer Science 2024-08-07 Atsushi Kojima