English
Related papers

Related papers: Improving Joint Speech-Text Representations Withou…

200 papers

Recently, a number of approaches to train speech models by incorpo-rating text into end-to-end models have been developed, with Mae-stro advancing state-of-the-art automatic speech recognition (ASR)and Speech Translation (ST) performance.…

Computation and Language · Computer Science 2023-05-01 Gary Wang , Kyle Kastner , Ankur Bapna , Zhehuai Chen , Andrew Rosenberg , Bhuvana Ramabhadran , Yu Zhang

Self-supervised pre-training has been successful in both text and speech processing. Speech and text offer different but complementary information. The question is whether we are able to perform a speech-text joint pre-training on unpaired…

Computation and Language · Computer Science 2022-11-01 Xianghu Yue , Junyi Ao , Xiaoxue Gao , Haizhou Li

Recent research has shown that word embedding spaces learned from text corpora of different languages can be aligned without any parallel data supervision. Inspired by the success in unsupervised cross-lingual word embeddings, in this paper…

Computation and Language · Computer Science 2018-09-24 Yu-An Chung , Wei-Hung Weng , Schrasing Tong , James Glass

LLM-based automatic speech recognition models demonstrate strong performance by connecting audio encoders and LLMs. However, data scarcity of paired speech and transcription often hinders their adaptation to new domains, making text-only…

Sound · Computer Science 2026-05-15 Ryo Magoshi , Takashi Maekaku , Yusuke Shinohara

Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. However, the length of speech…

Sound · Computer Science 2023-10-10 Jiaxu Zhu , Weinan Tong , Yaoxun Xu , Changhe Song , Zhiyong Wu , Zhao You , Dan Su , Dong Yu , Helen Meng

Recent advancements in supervised automatic speech recognition (ASR) have achieved remarkable performance, largely due to the growing availability of large transcribed speech corpora. However, most languages lack sufficient paired speech…

Computation and Language · Computer Science 2025-01-10 Junrui Ni , Liming Wang , Yang Zhang , Kaizhi Qian , Heting Gao , Mark Hasegawa-Johnson , Chang D. Yoo

Cross-lingual alignment in pretrained language models enables knowledge transfer across languages. Similar alignment has been reported in Whisper-style speech encoders, based on spoken translation retrieval using representational…

Computation and Language · Computer Science 2026-04-07 Ryan Soh-Eun Shim , Domenico De Cristofaro , Chengzhi Martin Hu , Alessandro Vietti , Barbara Plank

Conventional end-to-end automatic speech recognition (ASR) systems rely on paired speech-text data for domain adaptation. Recent LLM-based ASR architectures connect a speech encoder to a large language model via a projection module,…

Speech representation learning has improved both speech understanding and speech synthesis tasks for single language. However, its ability in cross-lingual scenarios has not been explored. In this paper, we extend the pretraining method for…

Audio and Speech Processing · Electrical Eng. & Systems 2022-12-06 Xiaoran Fan , Chao Pang , Tian Yuan , He Bai , Renjie Zheng , Pengfei Zhu , Shuohuan Wang , Junkun Chen , Zeyu Chen , Liang Huang , Yu Sun , Hua Wu

Embeddings play an important role in end-to-end solutions for multi-modal language processing problems. Although there has been some effort to understand the properties of single-modality embedding spaces, particularly that of text, their…

Computation and Language · Computer Science 2023-01-20 Muhammad Huzaifah , Ivan Kukanov

Multi-speaker automatic speech recognition (ASR) is crucial for many real-world applications, but it requires dedicated modeling techniques. Existing approaches can be divided into modular and end-to-end methods. Modular approaches separate…

Computation and Language · Computer Science 2023-06-22 Simon Berger , Peter Vieting , Christoph Boeddeker , Ralf Schlüter , Reinhold Haeb-Umbach

Sequence-to-sequence automatic speech recognition (ASR) models require large quantities of data to attain high performance. For this reason, there has been a recent surge in interest for unsupervised and semi-supervised training in such…

Audio and Speech Processing · Electrical Eng. & Systems 2019-08-21 Murali Karthick Baskar , Shinji Watanabe , Ramon Astudillo , Takaaki Hori , Lukáš Burget , Jan Černocký

Speech language models (Speech LMs) enable end-to-end speech-text modeling within a single model, offering a promising direction for spoken dialogue systems. The choice of speech-text jointly decoding paradigm plays a critical role in…

Audio and Speech Processing · Electrical Eng. & Systems 2026-02-12 Haibin Wu , Yuxuan Hu , Ruchao Fan , Xiaofei Wang , Kenichi Kumatani , Bo Ren , Jianwei Yu , Heng Lu , Lijuan Wang , Yao Qian , Jinyu Li

Recent work on speech representation models jointly pre-trained with text has demonstrated the potential of improving speech representations by encoding speech and text in a shared space. In this paper, we leverage such shared…

Computation and Language · Computer Science 2023-10-10 Chung-Ming Chien , Mingjiamei Zhang , Ju-Chieh Chou , Karen Livescu

A joint speech and text optimization method is proposed for hybrid transducer and attention-based encoder decoder (TAED) modeling to leverage large amounts of text corpus and enhance ASR accuracy. The joint TAED (J-TAED) is trained with…

Computation and Language · Computer Science 2025-06-25 Yun Tang , Eesung Kim , Vijendra Raj Apsingekar

Aligning acoustic and linguistic representations is a central challenge to bridge the pre-trained models in knowledge transfer for automatic speech recognition (ASR). This alignment is inherently structured and asymmetric: while multiple…

Computation and Language · Computer Science 2026-03-06 Xugang Lu , Peng Shen , Hisashi Kawai

The advances in attention-based encoder-decoder (AED) networks have brought great progress to end-to-end (E2E) automatic speech recognition (ASR). One way to further improve the performance of AED-based E2E ASR is to introduce an extra text…

Sound · Computer Science 2021-10-26 Wei Wang , Shuo Ren , Yao Qian , Shujie Liu , Yu Shi , Yanmin Qian , Michael Zeng

Self-supervised pretraining for Automated Speech Recognition (ASR) has shown varied degrees of success. In this paper, we propose to jointly learn representations during pretraining from two different modalities: speech and text. The…

Computation and Language · Computer Science 2021-08-30 Zhehuai Chen , Yu Zhang , Andrew Rosenberg , Bhuvana Ramabhadran , Gary Wang , Pedro Moreno

We propose an end-to-end Automatic Speech Recognition (ASR) system that can be trained on transcribed speech data, text-only data, or a mixture of both. The proposed model uses an integrated auxiliary block for text-based training. This…

Audio and Speech Processing · Electrical Eng. & Systems 2024-07-08 Vladimir Bataev , Roman Korostik , Evgeny Shabalin , Vitaly Lavrukhin , Boris Ginsburg

Continuous sign language recognition (SLR) deals with unaligned video-text pair and uses the word error rate (WER), i.e., edit distance, as the main evaluation metric. Since it is not differentiable, we usually instead optimize the learning…

Computer Vision and Pattern Recognition · Computer Science 2020-10-13 Junfu Pu , Wengang Zhou , Hezhen Hu , Houqiang Li
‹ Prev 1 2 3 10 Next ›