English
Related papers

Related papers: MMSpeech: Multi-modal Multi-task Encoder-Decoder P…

200 papers

Training speech recognizers with unpaired speech and text -- known as unsupervised speech recognition (UASR) -- is a crucial step toward extending ASR to low-resource languages in the long-tail distribution and enabling multimodal learning…

Computation and Language · Computer Science 2025-10-07 Liming Wang , Junrui Ni , Kai-Wei Chang , Saurabhchand Bhati , David Harwath , Mark Hasegawa-Johnson , James R. Glass

Traditionally, research in automated speech recognition has focused on local-first encoding of audio representations to predict the spoken phonemes in an utterance. Unfortunately, approaches relying on such hyper-local information tend to…

Audio and Speech Processing · Electrical Eng. & Systems 2022-09-19 David M. Chan , Shalini Ghosh , Debmalya Chakrabarty , Björn Hoffmeister

Self-supervised speech pre-training empowers the model with the contextual structure inherent in the speech signal while self-supervised text pre-training empowers the model with linguistic information. Both of them are beneficial for…

Sound · Computer Science 2022-11-28 Zhuoyuan Yao , Shuo Ren , Sanyuan Chen , Ziyang Ma , Pengcheng Guo , Lei Xie

This paper investigates different pretraining approaches to spoken language identification. The paper is based on our submission to the Oriental Language Recognition 2021 Challenge. We participated in two tracks of the challenge:…

Audio and Speech Processing · Electrical Eng. & Systems 2022-05-17 Tanel Alumäe , Kunnar Kukk

Recently, there has been an increasing interest in neural speech synthesis. While the deep neural network achieves the state-of-the-art result in text-to-speech (TTS) tasks, how to generate a more emotional and more expressive speech is…

Computation and Language · Computer Science 2021-06-24 Chenye Cui , Yi Ren , Jinglin Liu , Feiyang Chen , Rongjie Huang , Ming Lei , Zhou Zhao

Multi-talker speech recognition and target-talker speech recognition, both involve transcription in multi-talker contexts, remain significant challenges. However, existing methods rarely attempt to simultaneously address both tasks. In this…

Sound · Computer Science 2024-08-27 Lingwei Meng , Jiawen Kang , Yuejiao Wang , Zengrui Jin , Xixin Wu , Xunying Liu , Helen Meng

Recently, masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition. It usually requires a codebook obtained in an unsupervised way, making it less accurate and difficult to…

Computation and Language · Computer Science 2022-06-22 Chengyi Wang , Yiming Wang , Yu Wu , Sanyuan Chen , Jinyu Li , Shujie Liu , Furu Wei

The phonological discrepancies between a speaker's native (L1) and the non-native language (L2) serves as a major factor for mispronunciation. This paper introduces a novel multilingual MDD architecture, L1-MultiMDD, enriched with L1-aware…

Computation and Language · Computer Science 2023-09-22 Yassine El Kheir , Shammur Absar Chowdhury , Ahmed Ali

We describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition. The proposed method incorporates four self-supervised and supervised subtasks for cross modality…

Computation and Language · Computer Science 2022-04-13 Yun Tang , Hongyu Gong , Ning Dong , Changhan Wang , Wei-Ning Hsu , Jiatao Gu , Alexei Baevski , Xian Li , Abdelrahman Mohamed , Michael Auli , Juan Pino

This paper proposes a novel approach to pre-train encoder-decoder sequence-to-sequence (seq2seq) model with unpaired speech and transcripts respectively. Our pre-training method is divided into two stages, named acoustic pre-trianing and…

Sound · Computer Science 2020-01-03 Zhiyun Fan , Shiyu Zhou , Bo Xu

The choice of modeling units is crucial for automatic speech recognition (ASR) tasks. In mandarin scenarios, the Chinese characters represent meaning but are not directly related to the pronunciation. Thus only considering the writing of…

Computation and Language · Computer Science 2022-10-19 Yuting Yang , Binbin Du , Yuke Li

We propose a novel end-to-end multi-talker automatic speech recognition (ASR) framework that enables both multi-speaker (MS) ASR and target-speaker (TS) ASR. Our proposed model is trained in a fully end-to-end manner, incorporating speaker…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-20 Jinhan Wang , Weiqing Wang , Kunal Dhawan , Taejin Park , Myungjong Kim , Ivan Medennikov , He Huang , Nithin Koluguri , Jagadeesh Balam , Boris Ginsburg

In the traditional cascading architecture for spoken language understanding (SLU), it has been observed that automatic speech recognition errors could be detrimental to the performance of natural language understanding. End-to-end (E2E) SLU…

Computation and Language · Computer Science 2021-09-02 Qian Chen , Wen Wang , Qinglin Zhang

In recent research, slight performance improvement is observed from automatic speech recognition systems to audio-visual speech recognition systems in the end-to-end framework with low-quality videos. Unmatching convergence rates and…

Computation and Language · Computer Science 2024-03-12 Yusheng Dai , Hang Chen , Jun Du , Xiaofei Ding , Ning Ding , Feijun Jiang , Chin-Hui Lee

Transformers have recently become very popular for sequence-to-sequence applications such as machine translation and speech recognition. In this work, we propose a multi-task learning-based transformer model for low-resource multilingual…

Computation and Language · Computer Science 2021-09-13 Krishna D N

We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages. mSLAM combines…

Computation and Language · Computer Science 2022-02-04 Ankur Bapna , Colin Cherry , Yu Zhang , Ye Jia , Melvin Johnson , Yong Cheng , Simran Khanuja , Jason Riesa , Alexis Conneau

We live in a world where 60% of the population can speak two or more languages fluently. Members of these communities constantly switch between languages when having a conversation. As automatic speech recognition (ASR) systems are being…

Computation and Language · Computer Science 2021-02-16 Siddharth Dalmia , Yuzong Liu , Srikanth Ronanki , Katrin Kirchhoff

In this paper, we present our overall efforts to improve the performance of a code-switching speech recognition system using semi-supervised training methods from lexicon learning to acoustic modeling, on the South East Asian…

Computation and Language · Computer Science 2018-06-19 Pengcheng Guo , Haihua Xu , Lei Xie , Eng Siong Chng

Modeling code-switched speech is an important problem in automatic speech recognition (ASR). Labeled code-switched data are rare, so monolingual data are often used to model code-switched speech. These monolingual data may be more closely…

Computation and Language · Computer Science 2021-06-16 Andrew Slottje , Shannon Wotherspoon , William Hartmann , Matthew Snover , Owen Kimball

We propose a first step toward multilingual end-to-end automatic speech recognition (ASR) by integrating knowledge about speech articulators. The key idea is to leverage a rich set of fundamental units that can be defined "universally"…

Audio and Speech Processing · Electrical Eng. & Systems 2023-09-19 Hao Yen , Sabato Marco Siniscalchi , Chin-Hui Lee
‹ Prev 1 2 3 10 Next ›