English

Diffusion Language Models for Speech Recognition

Computation and Language 2026-04-30 v2 Artificial Intelligence Machine Learning Neural and Evolutionary Computing

Abstract

Diffusion language models have recently emerged as a leading alternative to standard language models, due to their ability for bidirectional attention and parallel text generation. In this work, we explore variants for their use in speech recognition. Specifically, we introduce a comprehensive guide to incorporating masked diffusion language models (MDLM) and uniform-state diffusion models (USDMs) for rescoring ASR hypotheses. Additionally, we design a new joint-decoding method that combines CTC and USDM by integrating the framewise probability distributions derived from CTC with the labelwise probability distributions computed by USDM at each decoding step, thereby generating new candidates that combine strong language knowledge from USDM and acoustic information from CTC. Our findings reveal that USDM, as well as MDLM, can significantly improve the accuracy of recognized text. We publish all our code and recipes.

Keywords

Cite

@article{arxiv.2604.14001,
  title  = {Diffusion Language Models for Speech Recognition},
  author = {Davyd Naveriani and Albert Zeyer and Ralf Schlüter and Hermann Ney},
  journal= {arXiv preprint arXiv:2604.14001},
  year   = {2026}
}
R2 v1 2026-07-01T12:10:58.692Z