English
Related papers

Related papers: Understanding Shared Speech-Text Representations

200 papers

Speech-to-text translation (ST), which translates source language speech into target language text, has attracted intensive attention in recent years. Compared to the traditional pipeline system, the end-to-end ST model has potential…

Computation and Language · Computer Science 2019-12-17 Yuchen Liu , Jiajun Zhang , Hao Xiong , Long Zhou , Zhongjun He , Hua Wu , Haifeng Wang , Chengqing Zong

The last year has seen astonishing progress in text-prompted image generation premised on the idea of a cross-modal representation space in which the text and image domains are represented jointly. In ASR, this idea has found application as…

Computation and Language · Computer Science 2023-08-14 Cal Peyser , Zhong Meng , Ke Hu , Rohit Prabhavalkar , Andrew Rosenberg , Tara N. Sainath , Michael Picheny , Kyunghyun Cho

Recent work on speech representation models jointly pre-trained with text has demonstrated the potential of improving speech representations by encoding speech and text in a shared space. In this paper, we leverage such shared…

Computation and Language · Computer Science 2023-10-10 Chung-Ming Chien , Mingjiamei Zhang , Ju-Chieh Chou , Karen Livescu

We present Maestro, a self-supervised training method to unify representations learnt from speech and text modalities. Self-supervised learning from speech signals aims to learn the latent structure inherent in the signal, while…

Computation and Language · Computer Science 2022-07-05 Zhehuai Chen , Yu Zhang , Andrew Rosenberg , Bhuvana Ramabhadran , Pedro Moreno , Ankur Bapna , Heiga Zen

End-to-end transformer-based automatic speech recognition (ASR) systems often capture multiple speech traits in their learned representations that are highly entangled, leading to a lack of interpretability. In this study, we propose the…

Audio and Speech Processing · Electrical Eng. & Systems 2024-11-28 Pu Wang , Hugo Van hamme

Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. However, the length of speech…

Sound · Computer Science 2023-10-10 Jiaxu Zhu , Weinan Tong , Yaoxun Xu , Changhe Song , Zhiyong Wu , Zhao You , Dan Su , Dong Yu , Helen Meng

This paper presents our latest investigation on end-to-end automatic speech recognition (ASR) for overlapped speech. We propose to train an end-to-end system conditioned on speaker embeddings and further improved by transfer learning from…

Audio and Speech Processing · Electrical Eng. & Systems 2019-08-14 Pavel Denisov , Ngoc Thang Vu

End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and…

Computation and Language · Computer Science 2020-10-29 Yuchen Liu , Junnan Zhu , Jiajun Zhang , Chengqing Zong

In end-to-end speech translation, acoustic representations learned by the encoder are usually fixed and static, from the perspective of the decoder, which is not desirable for dealing with the cross-modal and cross-lingual challenge in…

Computation and Language · Computer Science 2025-03-19 Wuwei Huang , Dexin Wang , Deyi Xiong

The advances in attention-based encoder-decoder (AED) networks have brought great progress to end-to-end (E2E) automatic speech recognition (ASR). One way to further improve the performance of AED-based E2E ASR is to introduce an extra text…

Sound · Computer Science 2021-10-26 Wei Wang , Shuo Ren , Yao Qian , Shujie Liu , Yu Shi , Yanmin Qian , Michael Zeng

In this work, we learn a shared encoding representation for a multi-task neural network model optimized with connectionist temporal classification (CTC) and conventional framewise cross-entropy training criteria. Our experiments show that…

Audio and Speech Processing · Electrical Eng. & Systems 2019-04-04 Thai-Son Nguyen , Sebastian Stueker , Alex Waibel

Multi-speaker automatic speech recognition (ASR) is crucial for many real-world applications, but it requires dedicated modeling techniques. Existing approaches can be divided into modular and end-to-end methods. Modular approaches separate…

Computation and Language · Computer Science 2023-06-22 Simon Berger , Peter Vieting , Christoph Boeddeker , Ralf Schlüter , Reinhold Haeb-Umbach

This paper investigates efficient methods for utilizing text-only data to improve speech recognition, focusing on encoder-dominated models that facilitate faster recognition. We provide a comprehensive comparison of techniques to integrate…

Computation and Language · Computer Science 2026-04-30 Albert Zeyer , Tim Posielek , Ralf Schlüter , Hermann Ney

To let the state-of-the-art end-to-end ASR model enjoy data efficiency, as well as much more unpaired text data by multi-modal training, one needs to address two problems: 1) the synchronicity of feature sampling rates between speech and…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-02 Yuhang Yang , Haihua Xu , Hao Huang , Eng Siong Chng , Sheng Li

Embeddings play an important role in end-to-end solutions for multi-modal language processing problems. Although there has been some effort to understand the properties of single-modality embedding spaces, particularly that of text, their…

Computation and Language · Computer Science 2023-01-20 Muhammad Huzaifah , Ivan Kukanov

This paper addresses end-to-end automatic speech recognition (ASR) for long audio recordings such as lecture and conversational speeches. Most end-to-end ASR models are designed to recognize independent utterances, but contextual…

Computation and Language · Computer Science 2021-04-20 Takaaki Hori , Niko Moritz , Chiori Hori , Jonathan Le Roux

Self-supervised learning (SSL) methods which learn representations of data without explicit supervision have gained popularity in speech-processing tasks, particularly for single-talker applications. However, these models often have…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-02 Zili Huang , Desh Raj , Paola García , Sanjeev Khudanpur

Recent research has shown that word embedding spaces learned from text corpora of different languages can be aligned without any parallel data supervision. Inspired by the success in unsupervised cross-lingual word embeddings, in this paper…

Computation and Language · Computer Science 2018-09-24 Yu-An Chung , Wei-Hung Weng , Schrasing Tong , James Glass

Automatic Speech Recognition (ASR) in conversational settings presents unique challenges, including extracting relevant contextual information from previous conversational turns. Due to irrelevant content, error propagation, and redundancy,…

Sound · Computer Science 2024-04-30 Kun Wei , Bei Li , Hang Lv , Quan Lu , Ning Jiang , Lei Xie

Training state-of-the-art Automated Speech Recognition (ASR) models typically requires a substantial amount of transcribed speech. In this work, we demonstrate that a modality-matched joint speech and text model can be leveraged to train a…

Computation and Language · Computer Science 2022-10-24 Zhehuai Chen , Ankur Bapna , Andrew Rosenberg , Yu Zhang , Bhuvana Ramabhadran , Pedro Moreno , Nanxin Chen
‹ Prev 1 2 3 10 Next ›