Related papers: MMSpeech: Multi-modal Multi-task Encoder-Decoder P…

Towards Unsupervised Speech Recognition at the Syllable-Level

Training speech recognizers with unpaired speech and text -- known as unsupervised speech recognition (UASR) -- is a crucial step toward extending ASR to low-resource languages in the long-tail distribution and enabling multimodal learning…

Computation and Language · Computer Science 2025-10-07 Liming Wang , Junrui Ni , Kai-Wei Chang , Saurabhchand Bhati , David Harwath , Mark Hasegawa-Johnson , James R. Glass

Multi-Modal Pre-Training for Automated Speech Recognition

Traditionally, research in automated speech recognition has focused on local-first encoding of audio representations to predict the spoken phonemes in an utterance. Unfortunately, approaches relying on such hyper-local information tend to…

Audio and Speech Processing · Electrical Eng. & Systems 2022-09-19 David M. Chan , Shalini Ghosh , Debmalya Chakrabarty , Björn Hoffmeister

TESSP: Text-Enhanced Self-Supervised Speech Pre-training

Self-supervised speech pre-training empowers the model with the contextual structure inherent in the speech signal while self-supervised text pre-training empowers the model with linguistic information. Both of them are beneficial for…

Sound · Computer Science 2022-11-28 Zhuoyuan Yao , Shuo Ren , Sanyuan Chen , Ziyang Ma , Pengcheng Guo , Lei Xie

Pretraining Approaches for Spoken Language Recognition: TalTech Submission to the OLR 2021 Challenge

This paper investigates different pretraining approaches to spoken language identification. The paper is based on our submission to the Oriental Language Recognition 2021 Challenge. We participated in two tracks of the challenge:…

Audio and Speech Processing · Electrical Eng. & Systems 2022-05-17 Tanel Alumäe , Kunnar Kukk

EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model

Recently, there has been an increasing interest in neural speech synthesis. While the deep neural network achieves the state-of-the-art result in text-to-speech (TTS) tasks, how to generate a more emotional and more expressive speech is…

Computation and Language · Computer Science 2021-06-24 Chenye Cui , Yi Ren , Jinglin Liu , Feiyang Chen , Rongjie Huang , Ming Lei , Zhou Zhao

Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System

Multi-talker speech recognition and target-talker speech recognition, both involve transcription in multi-talker contexts, remain significant challenges. However, existing methods rarely attempt to simultaneously address both tasks. In this…

Sound · Computer Science 2024-08-27 Lingwei Meng , Jiawen Kang , Yuejiao Wang , Zengrui Jin , Xixin Wu , Xunying Liu , Helen Meng

Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training

Recently, masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition. It usually requires a codebook obtained in an unsupervised way, making it less accurate and difficult to…

Computation and Language · Computer Science 2022-06-22 Chengyi Wang , Yiming Wang , Yu Wu , Sanyuan Chen , Jinyu Li , Shujie Liu , Furu Wei

L1-aware Multilingual Mispronunciation Detection Framework

The phonological discrepancies between a speaker's native (L1) and the non-native language (L2) serves as a major factor for mispronunciation. This paper introduces a novel multilingual MDD architecture, L1-MultiMDD, enriched with L1-aware…

Computation and Language · Computer Science 2023-09-22 Yassine El Kheir , Shammur Absar Chowdhury , Ahmed Ali

Unified Speech-Text Pre-training for Speech Translation and Recognition

We describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition. The proposed method incorporates four self-supervised and supervised subtasks for cross modality…

Computation and Language · Computer Science 2022-04-13 Yun Tang , Hongyu Gong , Ning Dong , Changhan Wang , Wei-Ning Hsu , Jiatao Gu , Alexei Baevski , Xian Li , Abdelrahman Mohamed , Michael Auli , Juan Pino

Unsupervised pre-training for sequence to sequence speech recognition

This paper proposes a novel approach to pre-train encoder-decoder sequence-to-sequence (seq2seq) model with unpaired speech and transcripts respectively. Our pre-training method is divided into two stages, named acoustic pre-trianing and…

Sound · Computer Science 2020-01-03 Zhiyun Fan , Shiyu Zhou , Bo Xu

Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition

The choice of modeling units is crucial for automatic speech recognition (ASR) tasks. In mandarin scenarios, the Chinese characters represent meaning but are not directly related to the pronunciation. Thus only considering the writing of…

Computation and Language · Computer Science 2022-10-19 Yuting Yang , Binbin Du , Yuke Li

META-CAT: Speaker-Informed Speech Embeddings via Meta Information Concatenation for Multi-talker ASR

We propose a novel end-to-end multi-talker automatic speech recognition (ASR) framework that enables both multi-speaker (MS) ASR and target-speaker (TS) ASR. Our proposed model is trained in a fully end-to-end manner, incorporating speaker…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-20 Jinhan Wang , Weiqing Wang , Kunal Dhawan , Taejin Park , Myungjong Kim , Ivan Medennikov , He Huang , Nithin Koluguri , Jagadeesh Balam , Boris Ginsburg

Pre-training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning

In the traditional cascading architecture for spoken language understanding (SLU), it has been observed that automatic speech recognition errors could be detrimental to the performance of natural language understanding. End-to-end (E2E) SLU…

Computation and Language · Computer Science 2021-09-02 Qian Chen , Wen Wang , Qinglin Zhang

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder

In recent research, slight performance improvement is observed from automatic speech recognition systems to audio-visual speech recognition systems in the end-to-end framework with low-quality videos. Unmatching convergence rates and…

Computation and Language · Computer Science 2024-03-12 Yusheng Dai , Hang Chen , Jun Du , Xiaofei Ding , Ning Ding , Feijun Jiang , Chin-Hui Lee

Multilingual Speech Recognition for Low-Resource Indian Languages using Multi-Task conformer

Transformers have recently become very popular for sequence-to-sequence applications such as machine translation and speech recognition. In this work, we propose a multi-task learning-based transformer model for low-resource multilingual…

Computation and Language · Computer Science 2021-09-13 Krishna D N

mSLAM: Massively multilingual joint pre-training for speech and text

We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages. mSLAM combines…

Computation and Language · Computer Science 2022-02-04 Ankur Bapna , Colin Cherry , Yu Zhang , Ye Jia , Melvin Johnson , Yong Cheng , Simran Khanuja , Jason Riesa , Alexis Conneau

Transformer-Transducers for Code-Switched Speech Recognition

We live in a world where 60% of the population can speak two or more languages fluently. Members of these communities constantly switch between languages when having a conversation. As automatic speech recognition (ASR) systems are being…

Computation and Language · Computer Science 2021-02-16 Siddharth Dalmia , Yuzong Liu , Srikanth Ronanki , Katrin Kirchhoff

Study of Semi-supervised Approaches to Improving English-Mandarin Code-Switching Speech Recognition

In this paper, we present our overall efforts to improve the performance of a code-switching speech recognition system using semi-supervised training methods from lexicon learning to acoustic modeling, on the South East Asian…

Computation and Language · Computer Science 2018-06-19 Pengcheng Guo , Haihua Xu , Lei Xie , Eng Siong Chng

Using heterogeneity in semi-supervised transcription hypotheses to improve code-switched speech recognition

Modeling code-switched speech is an important problem in automatic speech recognition (ASR). Labeled code-switched data are rare, so monolingual data are often used to model code-switched speech. These monolingual data may be more closely…

Computation and Language · Computer Science 2021-06-16 Andrew Slottje , Shannon Wotherspoon , William Hartmann , Matthew Snover , Owen Kimball

Boosting End-to-End Multilingual Phoneme Recognition through Exploiting Universal Speech Attributes Constraints

We propose a first step toward multilingual end-to-end automatic speech recognition (ASR) by integrating knowledge about speech articulators. The key idea is to leverage a rich set of fundamental units that can be defined "universally"…

Audio and Speech Processing · Electrical Eng. & Systems 2023-09-19 Hao Yen , Sabato Marco Siniscalchi , Chin-Hui Lee