Related papers: Internal Language Model Estimation Through Explici…

Investigating Methods to Improve Language Model Integration for Attention-based Encoder-Decoder ASR Models

Attention-based encoder-decoder (AED) models learn an implicit internal language model (ILM) from the training transcriptions. The integration with an external LM trained on much more unpaired text usually leads to better performance. A…

Computation and Language · Computer Science 2021-06-18 Mohammad Zeineldeen , Aleksandr Glushko , Wilfried Michel , Albert Zeyer , Ralf Schlüter , Hermann Ney

Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition

The external language models (LM) integration remains a challenging task for end-to-end (E2E) automatic speech recognition (ASR) which has no clear division between acoustic and language models. In this work, we propose an internal LM…

Audio and Speech Processing · Electrical Eng. & Systems 2020-11-05 Zhong Meng , Sarangarajan Parthasarathy , Eric Sun , Yashesh Gaur , Naoyuki Kanda , Liang Lu , Xie Chen , Rui Zhao , Jinyu Li , Yifan Gong

Internal Language Model Training for Domain-Adaptive End-to-End Speech Recognition

The efficacy of external language model (LM) integration with existing end-to-end (E2E) automatic speech recognition (ASR) systems can be improved significantly using the internal language model estimation (ILME) method. In this method, the…

Audio and Speech Processing · Electrical Eng. & Systems 2021-04-26 Zhong Meng , Naoyuki Kanda , Yashesh Gaur , Sarangarajan Parthasarathy , Eric Sun , Liang Lu , Xie Chen , Jinyu Li , Yifan Gong

Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition

Text-only adaptation of an end-to-end (E2E) model remains a challenging task for automatic speech recognition (ASR). Language model (LM) fusion-based approaches require an additional external LM during inference, significantly increasing…

Computation and Language · Computer Science 2022-11-01 Zhong Meng , Yashesh Gaur , Naoyuki Kanda , Jinyu Li , Xie Chen , Yu Wu , Yifan Gong

In-context Language Learning for Endangered Languages in Speech Recognition

With approximately 7,000 languages spoken worldwide, current large language models (LLMs) support only a small subset. Prior research indicates LLMs can learn new languages for certain tasks without supervised data. We extend this…

Computation and Language · Computer Science 2026-01-29 Zhaolin Li , Jan Niehues

Independent language modeling architecture for end-to-end ASR

The attention-based end-to-end (E2E) automatic speech recognition (ASR) architecture allows for joint optimization of acoustic and language models within a single network. However, in a vanilla E2E ASR architecture, the decoder sub-network…

Computation and Language · Computer Science 2019-12-03 Van Tung Pham , Haihua Xu , Yerbolat Khassanov , Zhiping Zeng , Eng Siong Chng , Chongjia Ni , Bin Ma , Haizhou Li

Attention-based Contextual Language Model Adaptation for Speech Recognition

Language modeling (LM) for automatic speech recognition (ASR) does not usually incorporate utterance level contextual information. For some domains like voice assistants, however, additional context, such as the time at which an utterance…

Computation and Language · Computer Science 2021-06-04 Richard Diehl Martinez , Scott Novotney , Ivan Bulyko , Ariya Rastrow , Andreas Stolcke , Ankur Gandhe

Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks

In the realm of spoken language understanding (SLU), numerous natural language understanding (NLU) methodologies have been adapted by supplying large language models (LLMs) with transcribed speech instead of conventional written text. In…

Computation and Language · Computer Science 2024-01-08 Kevin Everson , Yile Gu , Huck Yang , Prashanth Gurunath Shivakumar , Guan-Ting Lin , Jari Kolehmainen , Ivan Bulyko , Ankur Gandhe , Shalini Ghosh , Wael Hamza , Hung-yi Lee , Ariya Rastrow , Andreas Stolcke

Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model

Despite the rapid progress of end-to-end (E2E) automatic speech recognition (ASR), it has been shown that incorporating external language models (LMs) into the decoding can further improve the recognition performance of E2E ASR systems. To…

Computation and Language · Computer Science 2022-04-13 Jinchuan Tian , Jianwei Yu , Chao Weng , Yuexian Zou , Dong Yu

MTLM: Incorporating Bidirectional Text Information to Enhance Language Model Training in Speech Recognition Systems

Automatic speech recognition (ASR) systems normally consist of an acoustic model (AM) and a language model (LM). The acoustic model estimates the probability distribution of text given the input speech, while the language model calibrates…

Computation and Language · Computer Science 2025-06-17 Qingliang Meng , Pengju Ren , Tian Li , Changsong Dai , Huizhi Liang

Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text

Integrating audio encoders with LLMs through connectors has enabled these models to process and comprehend audio modalities, significantly enhancing speech-to-text tasks, including automatic speech recognition (ASR) and automatic speech…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-18 Hongfei Xue , Wei Ren , Xuelong Geng , Kun Wei , Longhao Li , Qijie Shao , Linju Yang , Kai Diao , Lei Xie

Joint unsupervised and supervised learning for context-aware language identification

Language identification (LID) recognizes the language of a spoken utterance automatically. According to recent studies, LID models trained with an automatic speech recognition (ASR) task perform better than those trained with a LID task…

Audio and Speech Processing · Electrical Eng. & Systems 2023-04-17 Jinseok Park , Hyung Yong Kim , Jihwan Park , Byeong-Yeol Kim , Shukjae Choi , Yunkyu Lim

Deep context: end-to-end contextual speech recognition

In automatic speech recognition (ASR) what a user says depends on the particular context she is in. Typically, this context is represented as a set of word n-grams. In this work, we present a novel, all-neural, end-to-end (E2E) ASR sys- tem…

Audio and Speech Processing · Electrical Eng. & Systems 2018-08-09 Golan Pundak , Tara N. Sainath , Rohit Prabhavalkar , Anjuli Kannan , Ding Zhao

Enhancing Large Language Model-based Speech Recognition by Contextualization for Rare and Ambiguous Words

We develop a large language model (LLM) based automatic speech recognition (ASR) system that can be contextualized by providing keywords as prior information in text prompts. We adopt decoder-only architecture and use our in-house LLM,…

Audio and Speech Processing · Electrical Eng. & Systems 2024-10-14 Kento Nozawa , Takashi Masuko , Toru Taniguchi

An Empirical Study of Language Model Integration for Transducer based Speech Recognition

Utilizing text-only data with an external language model (ELM) in end-to-end RNN-Transducer (RNN-T) for speech recognition is challenging. Recently, a class of methods such as density ratio (DR) and internal language model estimation (ILME)…

Audio and Speech Processing · Electrical Eng. & Systems 2022-08-04 Huahuan Zheng , Keyu An , Zhijian Ou , Chen Huang , Ke Ding , Guanglu Wan

On Language Model Integration for RNN Transducer based Speech Recognition

The mismatch between an external language model (LM) and the implicitly learned internal LM (ILM) of RNN-Transducer (RNN-T) can limit the performance of LM integration such as simple shallow fusion. A Bayesian interpretation suggests to…

Computation and Language · Computer Science 2022-02-17 Wei Zhou , Zuoyun Zheng , Ralf Schlüter , Hermann Ney

End-to-end contextual speech recognition using class language models and a token passing decoder

End-to-end modeling (E2E) of automatic speech recognition (ASR) blends all the components of a traditional speech recognition system into a unified model. Although it simplifies training and decoding pipelines, the unified model is hard to…

Computation and Language · Computer Science 2018-12-06 Zhehuai Chen , Mahaveer Jain , Yongqiang Wang , Michael L. Seltzer , Christian Fuegen

Multimodal In-context Learning for ASR of Low-resource Languages

Automatic speech recognition (ASR) still covers only a small fraction of the world's languages, mainly due to supervised data scarcity. In-context learning (ICL) with large language models (LLMs) addresses this problem, but prior work…

Computation and Language · Computer Science 2026-04-21 Zhaolin Li , Jan Niehues

CMT-LLM: Contextual Multi-Talker ASR Utilizing Large Language Models

In real-world applications, automatic speech recognition (ASR) systems must handle overlapping speech from multiple speakers and recognize rare words like technical terms. Traditional methods address multi-talker ASR and contextual biasing…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-17 Jiajun He , Naoki Sawada , Koichi Miyazaki , Tomoki Toda

Efficient Long-Form Speech Recognition for General Speech In-Context Learning

We propose a novel approach to end-to-end automatic speech recognition (ASR) to achieve efficient speech in-context learning (SICL) for (i) long-form speech decoding, (ii) test-time speaker adaptation, and (iii) test-time contextual…

Audio and Speech Processing · Electrical Eng. & Systems 2024-10-01 Hao Yen , Shaoshi Ling , Guoli Ye