Related papers: Joint Encoder-Decoder Self-Supervised Pre-training…

An Adapter based Multi-label Pre-training for Speech Separation and Enhancement

In recent years, self-supervised learning (SSL) has achieved tremendous success in various speech tasks due to its power to extract representations from massive unlabeled data. However, compared with tasks such as speech recognition (ASR),…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-14 Tianrui Wang , Xie Chen , Zhuo Chen , Shu Yu , Weibin Zhu

Efficient infusion of self-supervised representations in Automatic Speech Recognition

Self-supervised learned (SSL) models such as Wav2vec and HuBERT yield state-of-the-art results on speech-related tasks. Given the effectiveness of such models, it is advantageous to use them in conventional ASR systems. While some…

Computation and Language · Computer Science 2024-04-22 Darshan Prabhu , Sai Ganesh Mirishkar , Pankaj Wasnik

Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction

Existing Self-Supervised Learning (SSL) models for speech typically process speech signals at a fixed resolution of 20 milliseconds. This approach overlooks the varying informational content present at different resolutions in speech…

Sound · Computer Science 2024-01-31 Jiatong Shi , Hirofumi Inaguma , Xutai Ma , Ilia Kulikov , Anna Sun

UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-Training

Self-supervised learning (SSL) is a long-standing goal for speech processing, since it utilizes large-scale unlabeled data and avoids extensive human labeling. Recent years witness great successes in applying self-supervised learning in…

Computation and Language · Computer Science 2021-10-13 Sanyuan Chen , Yu Wu , Chengyi Wang , Zhengyang Chen , Zhuo Chen , Shujie Liu , Jian Wu , Yao Qian , Furu Wei , Jinyu Li , Xiangzhan Yu

Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition

Recent years have witnessed great strides in self-supervised learning (SSL) on the speech processing. The SSL model is normally pre-trained on a great variety of unlabelled data and a large model size is preferred to increase the modeling…

Audio and Speech Processing · Electrical Eng. & Systems 2025-05-08 Yujin Wang , Changli Tang , Ziyang Ma , Zhisheng Zheng , Xie Chen , Wei-Qiang Zhang

Channel-Aware Pretraining of Joint Encoder-Decoder Self-Supervised Model for Telephonic-Speech ASR

This paper proposes a novel technique to obtain better downstream ASR performance from a joint encoder-decoder self-supervised model when trained with speech pooled from two different channels (narrow and wide band). The joint…

Audio and Speech Processing · Electrical Eng. & Systems 2023-06-06 Vrunda N. Sukhadia , A. Arunkumar , S. Umesh

Exploration on HuBERT with Multiple Resolutions

Hidden-unit BERT (HuBERT) is a widely-used self-supervised learning (SSL) model in speech processing. However, we argue that its fixed 20ms resolution for hidden representations would not be optimal for various speech-processing tasks since…

Sound · Computer Science 2023-06-26 Jiatong Shi , Yun Tang , Hirofumi Inaguma , Hongyu GOng , Juan Pino , Shinji Watanabe

Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning

Self-supervised learning (SSL) of speech has shown impressive results in speech-related tasks, particularly in automatic speech recognition (ASR). While most methods employ the output of intermediate layers of the SSL model as real-valued…

Sound · Computer Science 2023-05-30 Xuankai Chang , Brian Yan , Yuya Fujita , Takashi Maekaku , Shinji Watanabe

Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute

Self-supervised learning (SSL) has led to great strides in speech processing. However, the resources needed to train these models has become prohibitively large as they continue to scale. Currently, only a few groups with substantial…

Computation and Language · Computer Science 2023-06-13 William Chen , Xuankai Chang , Yifan Peng , Zhaoheng Ni , Soumi Maiti , Shinji Watanabe

Large Language Model Guided Decoding for Self-Supervised Speech Recognition

Self-supervised automatic speech recognition (SSL-ASR) is an ASR approach that uses speech encoders pretrained on large amounts of unlabeled audio (e.g., wav2vec2.0 or HuBERT) and then fine-tunes them with limited labeled data to perform…

Audio and Speech Processing · Electrical Eng. & Systems 2026-01-07 Eyal Cohen , Bhiksha Raj , Joseph Keshet

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase,…

Computation and Language · Computer Science 2021-06-15 Wei-Ning Hsu , Benjamin Bolte , Yao-Hung Hubert Tsai , Kushal Lakhotia , Ruslan Salakhutdinov , Abdelrahman Mohamed

Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training

Recently, masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition. It usually requires a codebook obtained in an unsupervised way, making it less accurate and difficult to…

Computation and Language · Computer Science 2022-06-22 Chengyi Wang , Yiming Wang , Yu Wu , Sanyuan Chen , Jinyu Li , Shujie Liu , Furu Wei

Deploying self-supervised learning in the wild for hybrid automatic speech recognition

Self-supervised learning (SSL) methods have proven to be very successful in automatic speech recognition (ASR). These great improvements have been reported mostly based on highly curated datasets such as LibriSpeech for non-streaming…

Sound · Computer Science 2022-05-19 Mostafa Karimi , Changliang Liu , Kenichi Kumatani , Yao Qian , Tianyu Wu , Jian Wu

Investigation of Ensemble features of Self-Supervised Pretrained Models for Automatic Speech Recognition

Self-supervised learning (SSL) based models have been shown to generate powerful representations that can be used to improve the performance of downstream speech tasks. Several state-of-the-art SSL models are available, and each of these…

Computation and Language · Computer Science 2023-02-21 A Arunkumar , Vrunda N Sukhadia , S. Umesh

Progressive Multi-Scale Self-Supervised Learning for Speech Recognition

Self-supervised learning (SSL) models have achieved considerable improvements in automatic speech recognition (ASR). In addition, ASR performance could be further improved if the model is dedicated to audio content information learning…

Audio and Speech Processing · Electrical Eng. & Systems 2022-12-08 Genshun Wan , Tan Liu , Hang Chen , Jia Pan , Cong Liu , Zhongfu Ye

Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation

The excellent generalization ability of self-supervised learning (SSL) for speech foundation models has garnered significant attention. HuBERT is a successful example that utilizes offline clustering to convert speech features into discrete…

Computation and Language · Computer Science 2023-06-16 Ziyang Ma , Zhisheng Zheng , Guanrou Yang , Yu Wang , Chao Zhang , Xie Chen

Multilingual Speech Recognition Using Discrete Tokens with a Two-step Training Strategy

Pre-trained models, especially self-supervised learning (SSL) models, have demonstrated impressive results in automatic speech recognition (ASR) task. While most applications of SSL models focus on leveraging continuous representations as…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-03 Zehan Li , Yan Yang , Xueqing Li , Jian Kang , Xiao-Lei Zhang , Jie Li

Self-Supervised Learning for speech recognition with Intermediate layer supervision

Recently, pioneer work finds that speech pre-trained models can solve full-stack speech processing tasks, because the model utilizes bottom layers to learn speaker-related information and top layers to encode content-related information.…

Audio and Speech Processing · Electrical Eng. & Systems 2021-12-17 Chengyi Wang , Yu Wu , Sanyuan Chen , Shujie Liu , Jinyu Li , Yao Qian , Zhenglu Yang

Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data

This paper studies a novel pre-training technique with unpaired speech data, Speech2C, for encoder-decoder based automatic speech recognition (ASR). Within a multi-task learning framework, we introduce two pre-training tasks for the…

Sound · Computer Science 2022-06-22 Junyi Ao , Ziqiang Zhang , Long Zhou , Shujie Liu , Haizhou Li , Tom Ko , Lirong Dai , Jinyu Li , Yao Qian , Furu Wei

k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning

Self-supervised learning (SSL) has achieved great success in speech-related tasks. While Transformer and Conformer architectures have dominated SSL backbones, encoders like Zipformer, which excel in automatic speech recognition (ASR),…

Audio and Speech Processing · Electrical Eng. & Systems 2025-03-25 Yifan Yang , Jianheng Zhuo , Zengrui Jin , Ziyang Ma , Xiaoyu Yang , Zengwei Yao , Liyong Guo , Wei Kang , Fangjun Kuang , Long Lin , Daniel Povey , Xie Chen