Related papers: Improving Joint Speech-Text Representations Withou…

Understanding Shared Speech-Text Representations

Recently, a number of approaches to train speech models by incorpo-rating text into end-to-end models have been developed, with Mae-stro advancing state-of-the-art automatic speech recognition (ASR)and Speech Translation (ST) performance.…

Computation and Language · Computer Science 2023-05-01 Gary Wang , Kyle Kastner , Ankur Bapna , Zhehuai Chen , Andrew Rosenberg , Bhuvana Ramabhadran , Yu Zhang

token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text

Self-supervised pre-training has been successful in both text and speech processing. Speech and text offer different but complementary information. The question is whether we are able to perform a speech-text joint pre-training on unpaired…

Computation and Language · Computer Science 2022-11-01 Xianghu Yue , Junyi Ao , Xiaoxue Gao , Haizhou Li

Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces

Recent research has shown that word embedding spaces learned from text corpora of different languages can be aligned without any parallel data supervision. Inspired by the success in unsupervised cross-lingual word embeddings, in this paper…

Computation and Language · Computer Science 2018-09-24 Yu-An Chung , Wei-Hung Weng , Schrasing Tong , James Glass

Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR

LLM-based automatic speech recognition models demonstrate strong performance by connecting audio encoders and LLMs. However, data scarcity of paired speech and transcription often hinders their adaptation to new domains, making text-only…

Sound · Computer Science 2026-05-15 Ryo Magoshi , Takashi Maekaku , Yusuke Shinohara

Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. However, the length of speech…

Sound · Computer Science 2023-10-10 Jiaxu Zhu , Weinan Tong , Yaoxun Xu , Changhe Song , Zhiyong Wu , Zhao You , Dan Su , Dong Yu , Helen Meng

Towards Unsupervised Speech Recognition Without Pronunciation Models

Recent advancements in supervised automatic speech recognition (ASR) have achieved remarkable performance, largely due to the growing availability of large transcribed speech corpora. However, most languages lack sufficient paired speech…

Computation and Language · Computer Science 2025-01-10 Junrui Ni , Liming Wang , Yang Zhang , Kaizhi Qian , Heting Gao , Mark Hasegawa-Johnson , Chang D. Yoo

Languages in Whisper-Style Speech Encoders Align Both Phonetically and Semantically

Cross-lingual alignment in pretrained language models enables knowledge transfer across languages. Similar alignment has been reported in Whisper-style speech encoders, based on spoken translation retrieval using representational…

Computation and Language · Computer Science 2026-04-07 Ryan Soh-Eun Shim , Domenico De Cristofaro , Chengzhi Martin Hu , Alessandro Vietti , Barbara Plank

Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR

Conventional end-to-end automatic speech recognition (ASR) systems rely on paired speech-text data for domain adaptation. Recent LLM-based ASR architectures connect a speech encoder to a large language model via a projection module,…

Computation and Language · Computer Science 2026-04-09 Thibault Bañeras-Roux , Sergio Burdisso , Esaú Villatoro-Tello , Dairazalia Sánchez-Cortés , Shiran Liu , Severin Baroudi , Shashi Kumar , Hasindri Watawana , Manjunath K E , Kadri Hacioglu , Petr Motlicek , Andreas Stolcke

ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech

Speech representation learning has improved both speech understanding and speech synthesis tasks for single language. However, its ability in cross-lingual scenarios has not been explored. In this paper, we extend the pretraining method for…

Audio and Speech Processing · Electrical Eng. & Systems 2022-12-06 Xiaoran Fan , Chao Pang , Tian Yuan , He Bai , Renjie Zheng , Pengfei Zhu , Shuohuan Wang , Junkun Chen , Zeyu Chen , Liang Huang , Yu Sun , Hua Wu

An Analysis of Semantically-Aligned Speech-Text Embeddings

Embeddings play an important role in end-to-end solutions for multi-modal language processing problems. Although there has been some effort to understand the properties of single-modality embedding spaces, particularly that of text, their…

Computation and Language · Computer Science 2023-01-20 Muhammad Huzaifah , Ivan Kukanov

Mixture Encoder for Joint Speech Separation and Recognition

Multi-speaker automatic speech recognition (ASR) is crucial for many real-world applications, but it requires dedicated modeling techniques. Existing approaches can be divided into modular and end-to-end methods. Modular approaches separate…

Computation and Language · Computer Science 2023-06-22 Simon Berger , Peter Vieting , Christoph Boeddeker , Ralf Schlüter , Reinhold Haeb-Umbach

Semi-supervised Sequence-to-sequence ASR using Unpaired Speech and Text

Sequence-to-sequence automatic speech recognition (ASR) models require large quantities of data to attain high performance. For this reason, there has been a recent surge in interest for unsupervised and semi-supervised training in such…

Audio and Speech Processing · Electrical Eng. & Systems 2019-08-21 Murali Karthick Baskar , Shinji Watanabe , Ramon Astudillo , Takaaki Hori , Lukáš Burget , Jan Černocký

Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model

Speech language models (Speech LMs) enable end-to-end speech-text modeling within a single model, offering a promising direction for spoken dialogue systems. The choice of speech-text jointly decoding paradigm plays a critical role in…

Audio and Speech Processing · Electrical Eng. & Systems 2026-02-12 Haibin Wu , Yuxuan Hu , Ruchao Fan , Xiaofei Wang , Kenichi Kumatani , Bo Ren , Jianwei Yu , Heng Lu , Lijuan Wang , Yao Qian , Jinyu Li

Few-Shot Spoken Language Understanding via Joint Speech-Text Models

Recent work on speech representation models jointly pre-trained with text has demonstrated the potential of improving speech representations by encoding speech and text in a shared space. In this paper, we leverage such shared…

Computation and Language · Computer Science 2023-10-10 Chung-Ming Chien , Mingjiamei Zhang , Ju-Chieh Chou , Karen Livescu

Enhanced Hybrid Transducer and Attention Encoder Decoder with Text Data

A joint speech and text optimization method is proposed for hybrid transducer and attention-based encoder decoder (TAED) modeling to leverage large amounts of text corpus and enhance ASR accuracy. The joint TAED (J-TAED) is trained with…

Computation and Language · Computer Science 2025-06-25 Yun Tang , Eesung Kim , Vijendra Raj Apsingekar

New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR

Aligning acoustic and linguistic representations is a central challenge to bridge the pre-trained models in knowledge transfer for automatic speech recognition (ASR). This alignment is inherently structured and asymmetric: while multiple…

Computation and Language · Computer Science 2026-03-06 Xugang Lu , Peng Shen , Hisashi Kawai

Optimizing Alignment of Speech and Language Latent Spaces for End-to-End Speech Recognition and Understanding

The advances in attention-based encoder-decoder (AED) networks have brought great progress to end-to-end (E2E) automatic speech recognition (ASR). One way to further improve the performance of AED-based E2E ASR is to introduce an extra text…

Sound · Computer Science 2021-10-26 Wei Wang , Shuo Ren , Yao Qian , Shujie Liu , Yu Shi , Yanmin Qian , Michael Zeng

Injecting Text in Self-Supervised Speech Pretraining

Self-supervised pretraining for Automated Speech Recognition (ASR) has shown varied degrees of success. In this paper, we propose to jointly learn representations during pretraining from two different modalities: speech and text. The…

Computation and Language · Computer Science 2021-08-30 Zhehuai Chen , Yu Zhang , Andrew Rosenberg , Bhuvana Ramabhadran , Gary Wang , Pedro Moreno

Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator

We propose an end-to-end Automatic Speech Recognition (ASR) system that can be trained on transcribed speech data, text-only data, or a mixture of both. The proposed model uses an integrated auxiliary block for text-based training. This…

Audio and Speech Processing · Electrical Eng. & Systems 2024-07-08 Vladimir Bataev , Roman Korostik , Evgeny Shabalin , Vitaly Lavrukhin , Boris Ginsburg

Boosting Continuous Sign Language Recognition via Cross Modality Augmentation

Continuous sign language recognition (SLR) deals with unaligned video-text pair and uses the word error rate (WER), i.e., edit distance, as the main evaluation metric. Since it is not differentiable, we usually instead optimize the learning…

Computer Vision and Pattern Recognition · Computer Science 2020-10-13 Junfu Pu , Wengang Zhou , Hezhen Hu , Houqiang Li