Related papers: Understanding Shared Speech-Text Representations

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding

Speech-to-text translation (ST), which translates source language speech into target language text, has attracted intensive attention in recent years. Compared to the traditional pipeline system, the end-to-end ST model has potential…

Computation and Language · Computer Science 2019-12-17 Yuchen Liu , Jiajun Zhang , Hao Xiong , Long Zhou , Zhongjun He , Hua Wu , Haifeng Wang , Chengqing Zong

Improving Joint Speech-Text Representations Without Alignment

The last year has seen astonishing progress in text-prompted image generation premised on the idea of a cross-modal representation space in which the text and image domains are represented jointly. In ASR, this idea has found application as…

Computation and Language · Computer Science 2023-08-14 Cal Peyser , Zhong Meng , Ke Hu , Rohit Prabhavalkar , Andrew Rosenberg , Tara N. Sainath , Michael Picheny , Kyunghyun Cho

Few-Shot Spoken Language Understanding via Joint Speech-Text Models

Recent work on speech representation models jointly pre-trained with text has demonstrated the potential of improving speech representations by encoding speech and text in a shared space. In this paper, we leverage such shared…

Computation and Language · Computer Science 2023-10-10 Chung-Ming Chien , Mingjiamei Zhang , Ju-Chieh Chou , Karen Livescu

MAESTRO: Matched Speech Text Representations through Modality Matching

We present Maestro, a self-supervised training method to unify representations learnt from speech and text modalities. Self-supervised learning from speech signals aims to learn the latent structure inherent in the signal, while…

Computation and Language · Computer Science 2022-07-05 Zhehuai Chen , Yu Zhang , Andrew Rosenberg , Bhuvana Ramabhadran , Pedro Moreno , Ankur Bapna , Heiga Zen

Disentangled-Transformer: An Explainable End-to-End Automatic Speech Recognition Model with Speech Content-Context Separation

End-to-end transformer-based automatic speech recognition (ASR) systems often capture multiple speech traits in their learned representations that are highly entangled, leading to a lack of interpretability. In this study, we propose the…

Audio and Speech Processing · Electrical Eng. & Systems 2024-11-28 Pu Wang , Hugo Van hamme

Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. However, the length of speech…

Sound · Computer Science 2023-10-10 Jiaxu Zhu , Weinan Tong , Yaoxun Xu , Changhe Song , Zhiyong Wu , Zhao You , Dan Su , Dong Yu , Helen Meng

End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning

This paper presents our latest investigation on end-to-end automatic speech recognition (ASR) for overlapped speech. We propose to train an end-to-end system conditioned on speaker embeddings and further improved by transfer learning from…

Audio and Speech Processing · Electrical Eng. & Systems 2019-08-14 Pavel Denisov , Ngoc Thang Vu

Bridging the Modality Gap for Speech-to-Text Translation

End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and…

Computation and Language · Computer Science 2020-10-29 Yuchen Liu , Junnan Zhu , Jiajun Zhang , Chengqing Zong

AdaST: Dynamically Adapting Encoder States in the Decoder for End-to-End Speech-to-Text Translation

In end-to-end speech translation, acoustic representations learned by the encoder are usually fixed and static, from the perspective of the decoder, which is not desirable for dealing with the cross-modal and cross-lingual challenge in…

Computation and Language · Computer Science 2025-03-19 Wuwei Huang , Dexin Wang , Deyi Xiong

Optimizing Alignment of Speech and Language Latent Spaces for End-to-End Speech Recognition and Understanding

The advances in attention-based encoder-decoder (AED) networks have brought great progress to end-to-end (E2E) automatic speech recognition (ASR). One way to further improve the performance of AED-based E2E ASR is to introduce an extra text…

Sound · Computer Science 2021-10-26 Wei Wang , Shuo Ren , Yao Qian , Shujie Liu , Yu Shi , Yanmin Qian , Michael Zeng

Learning Shared Encoding Representation for End-to-End Speech Recognition Models

In this work, we learn a shared encoding representation for a multi-task neural network model optimized with connectionist temporal classification (CTC) and conventional framewise cross-entropy training criteria. Our experiments show that…

Audio and Speech Processing · Electrical Eng. & Systems 2019-04-04 Thai-Son Nguyen , Sebastian Stueker , Alex Waibel

Mixture Encoder for Joint Speech Separation and Recognition

Multi-speaker automatic speech recognition (ASR) is crucial for many real-world applications, but it requires dedicated modeling techniques. Existing approaches can be divided into modular and end-to-end methods. Modular approaches separate…

Computation and Language · Computer Science 2023-06-22 Simon Berger , Peter Vieting , Christoph Boeddeker , Ralf Schlüter , Reinhold Haeb-Umbach

Text-Utilization for Encoder-dominated Speech Recognition Models

This paper investigates efficient methods for utilizing text-only data to improve speech recognition, focusing on encoder-dominated models that facilitate faster recognition. We provide a comprehensive comparison of techniques to integrate…

Computation and Language · Computer Science 2026-04-30 Albert Zeyer , Tim Posielek , Ralf Schlüter , Hermann Ney

Speech-text based multi-modal training with bidirectional attention for improved speech recognition

To let the state-of-the-art end-to-end ASR model enjoy data efficiency, as well as much more unpaired text data by multi-modal training, one needs to address two problems: 1) the synchronicity of feature sampling rates between speech and…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-02 Yuhang Yang , Haihua Xu , Hao Huang , Eng Siong Chng , Sheng Li

An Analysis of Semantically-Aligned Speech-Text Embeddings

Embeddings play an important role in end-to-end solutions for multi-modal language processing problems. Although there has been some effort to understand the properties of single-modality embedding spaces, particularly that of text, their…

Computation and Language · Computer Science 2023-01-20 Muhammad Huzaifah , Ivan Kukanov

Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers

This paper addresses end-to-end automatic speech recognition (ASR) for long audio recordings such as lecture and conversational speeches. Most end-to-end ASR models are designed to recognize independent utterances, but contextual…

Computation and Language · Computer Science 2021-04-20 Takaaki Hori , Niko Moritz , Chiori Hori , Jonathan Le Roux

Adapting self-supervised models to multi-talker speech recognition using speaker embeddings

Self-supervised learning (SSL) methods which learn representations of data without explicit supervision have gained popularity in speech-processing tasks, particularly for single-talker applications. However, these models often have…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-02 Zili Huang , Desh Raj , Paola García , Sanjeev Khudanpur

Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces

Recent research has shown that word embedding spaces learned from text corpora of different languages can be aligned without any parallel data supervision. Inspired by the success in unsupervised cross-lingual word embeddings, in this paper…

Computation and Language · Computer Science 2018-09-24 Yu-An Chung , Wei-Hung Weng , Schrasing Tong , James Glass

Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

Automatic Speech Recognition (ASR) in conversational settings presents unique challenges, including extracting relevant contextual information from previous conversational turns. Due to irrelevant content, error propagation, and redundancy,…

Sound · Computer Science 2024-04-30 Kun Wei , Bei Li , Hang Lv , Quan Lu , Ning Jiang , Lei Xie

Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR

Training state-of-the-art Automated Speech Recognition (ASR) models typically requires a substantial amount of transcribed speech. In this work, we demonstrate that a modality-matched joint speech and text model can be leveraged to train a…

Computation and Language · Computer Science 2022-10-24 Zhehuai Chen , Ankur Bapna , Andrew Rosenberg , Yu Zhang , Bhuvana Ramabhadran , Pedro Moreno , Nanxin Chen