English
Related papers

Related papers: DQ-Data2vec: Decoupling Quantization for Multiling…

200 papers

In this paper, we propose a new Self-Supervised Learning (SSL) algorithm called data2vec-aqc, for speech representation learning from unlabeled speech data. Our goal is to improve SSL for speech in domains where both unlabeled and labeled…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-16 Vasista Sai Lodagala , Sreyan Ghosh , S. Umesh

Wav2vec-C introduces a novel representation learning technique combining elements from wav2vec 2.0 and VQ-VAE. Our model learns to reproduce quantized representations from partially masked speech encoding using a contrastive loss in a way…

Audio and Speech Processing · Electrical Eng. & Systems 2021-06-25 Samik Sadhu , Di He , Che-Wei Huang , Sri Harish Mallidi , Minhua Wu , Ariya Rastrow , Andreas Stolcke , Jasha Droppo , Roland Maas

While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self-supervised…

Machine Learning · Computer Science 2022-10-27 Alexei Baevski , Wei-Ning Hsu , Qiantong Xu , Arun Babu , Jiatao Gu , Michael Auli

Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes…

Machine Learning · Computer Science 2023-06-16 Alexei Baevski , Arun Babu , Wei-Ning Hsu , Michael Auli

The goal of self-supervised learning (SSL) for automatic speech recognition (ASR) is to learn good speech representations from a large amount of unlabeled speech for the downstream ASR task. However, most SSL frameworks do not consider…

Computation and Language · Computer Science 2022-01-27 Yiming Wang , Jinyu Li , Heming Wang , Yao Qian , Chengyi Wang , Yu Wu

This paper presents XLSR which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over…

Computation and Language · Computer Science 2020-12-17 Alexis Conneau , Alexei Baevski , Ronan Collobert , Abdelrahman Mohamed , Michael Auli

Neural audio codecs discretize speech via residual vector quantization (RVQ), forming a coarse-to-fine hierarchy across quantizers. While codec models have been explored for representation learning, their discrete structure remains…

Sound · Computer Science 2026-03-19 Jinyang Wu , Zihan Pan , Qiquan Zhang , Sailor Hardik Bhupendra , Soumik Mondal

Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition (ASR) and proved to be extremely useful in low label-resource settings. However, the…

Sound · Computer Science 2023-10-05 Weiwei Lin , Chenhang He , Man-Wai Mak , Youzhi Tu

Multilingual speech data often suffer from long-tailed language distribution, resulting in performance degradation. However, multilingual text data is much easier to obtain, yielding a more useful general language model. Hence, we are…

Computation and Language · Computer Science 2022-06-28 Kwanghee Choi , Hyung-Min Park

State-of-the-art automatic speech recognition (ASR) systems perform well on healthy speech. However, the performance on impaired speech still remains an issue. The current study explores the usefulness of using Wav2Vec self-supervised…

Computation and Language · Computer Science 2022-04-05 Abner Hernandez , Paula Andrea Pérez-Toro , Elmar Nöth , Juan Rafael Orozco-Arroyave , Andreas Maier , Seung Hee Yang

Pre-trained models, especially self-supervised learning (SSL) models, have demonstrated impressive results in automatic speech recognition (ASR) task. While most applications of SSL models focus on leveraging continuous representations as…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-03 Zehan Li , Yan Yang , Xueqing Li , Jian Kang , Xiao-Lei Zhang , Jie Li

Self-supervised learning (SSL) has shown significant progress in speech processing tasks. However, despite the intrinsic randomness in the Transformer structure, such as dropout variants and layer-drop, improving the model-level consistency…

Audio and Speech Processing · Electrical Eng. & Systems 2023-06-16 Ji Won Yoon , Seok Min Kim , Nam Soo Kim

Recent success in speech representation learning enables a new way to leverage unlabeled data to train speech recognition model. In speech representation learning, a large amount of unlabeled data is used in a self-supervised manner to…

Audio and Speech Processing · Electrical Eng. & Systems 2020-12-15 Shaoshi Ling , Yuzong Liu

Self-supervised-learning-based pre-trained models for speech data, such as Wav2Vec 2.0 (W2V2), have become the backbone of many speech tasks. In this paper, to achieve speaker diarisation and speech recognition using a single model, a…

Audio and Speech Processing · Electrical Eng. & Systems 2022-07-11 Xianrui Zheng , Chao Zhang , Philip C. Woodland

In this work, we study the features extracted by English self-supervised learning (SSL) models in cross-lingual contexts and propose a new metric to predict the quality of feature representations. Using automatic speech recognition (ASR) as…

Computation and Language · Computer Science 2023-11-28 Shuyue Stella Li , Beining Xu , Xiangyu Zhang , Hexin Liu , Wenhan Chao , Leibny Paola Garcia

In this work, we present a novel method, named AV2vec, for learning audio-visual speech representations by multimodal self-distillation. AV2vec has a student and a teacher module, in which the student performs a masked latent feature…

Audio and Speech Processing · Electrical Eng. & Systems 2022-12-07 Jing-Xuan Zhang , Genshun Wan , Zhen-Hua Ling , Jia Pan , Jianqing Gao , Cong Liu

Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe. This paper describes wav2vec-U,…

Computation and Language · Computer Science 2022-05-04 Alexei Baevski , Wei-Ning Hsu , Alexis Conneau , Michael Auli

We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In…

Computation and Language · Computer Science 2022-07-01 Chung-Cheng Chiu , James Qin , Yu Zhang , Jiahui Yu , Yonghui Wu

Current leading mispronunciation detection and diagnosis (MDD) systems achieve promising performance via end-to-end phoneme recognition. One challenge of such end-to-end solutions is the scarcity of human-annotated phonemes on natural L2…

Audio and Speech Processing · Electrical Eng. & Systems 2022-07-13 Mu Yang , Kevin Hirschi , Stephen D. Looney , Okim Kang , John H. L. Hansen

Recent studies have shown that frame-level deep speaker features can be derived from a deep neural network with the training target set to discriminate speakers by a short speech segment. By pooling the frame-level features, utterance-level…

Audio and Speech Processing · Electrical Eng. & Systems 2018-11-09 Lantian Li , Zhiyuan Tang , Ying Shi , Dong Wang
‹ Prev 1 2 3 10 Next ›