Related papers: Diffusion Large Language Models for Visual Speech …

dLLM-ASR: A Faster Diffusion LLM-based Framework for Speech Recognition

Automatic speech recognition (ASR) systems based on large language models (LLMs) achieve superior performance by leveraging pretrained LLMs as decoders, but their token-by-token generation mechanism leads to inference latency that grows…

Sound · Computer Science 2026-01-27 Wenjie Tian , Bingshen Mu , Guobin Ma , Xuelong Geng , Zhixian Zhao , Lei Xie

Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing

Visual Speech Recognition (VSR) transcribes speech by analyzing lip movements. Recently, Large Language Models (LLMs) have been integrated into VSR systems, leading to notable performance improvements. However, the potential of LLMs has not…

Computer Vision and Pattern Recognition · Computer Science 2025-06-04 Zehua Liu , Xiaolou Li , Li Guo , Lantian Li , Dong Wang

Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

Diffusion-based large language models (DLLMs) have recently attracted growing interest as an alternative to autoregressive decoders. In this work, we present an empirical study on using the diffusion-based large language model LLaDA for…

Audio and Speech Processing · Electrical Eng. & Systems 2026-03-02 Mengqi Wang , Zhan Liu , Zengrui Jin , Guangzhi Sun , Chao Zhang , Philip C. Woodland

From Hype to Insight: Rethinking Large Language Model Integration in Visual Speech Recognition

Advances in self-supervised encoders have improved Visual Speech Recognition (VSR). Recent approaches integrating these encoders with LLM decoders improves transcription accuracy; however, it remains unclear whether these gains stem from…

Sound · Computer Science 2026-01-21 Rishabh Jain , Naomi Harte

Large Language Models are Strong Audio-Visual Speech Recognition Learners

Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities. For example, in the audio and speech domains, an LLM can be equipped with (automatic)…

Computer Vision and Pattern Recognition · Computer Science 2025-03-10 Umberto Cappellazzo , Minsu Kim , Honglie Chen , Pingchuan Ma , Stavros Petridis , Daniele Falavigna , Alessio Brutti , Maja Pantic

Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models

Diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive (AR) LLMs. Recently, this paradigm has been extended to multimodal tasks, leading to the development of diffusion multimodal large language…

Artificial Intelligence · Computer Science 2026-04-08 Keuntae Kim , Mingyu Kang , Yong Suk Choi

Reproducing and Dissecting Denoising Language Models for Speech Recognition

Denoising language models (DLMs) have been proposed as a powerful alternative to traditional language models (LMs) for automatic speech recognition (ASR), motivated by their ability to use bidirectional context and adapt to a specific ASR…

Neural and Evolutionary Computing · Computer Science 2025-12-16 Dorian Koch , Albert Zeyer , Nick Rossenbach , Ralf Schlüter , Hermann Ney

DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Diffusion-based decoding has recently emerged as an appealing alternative to autoregressive (AR) generation, offering the potential to update multiple tokens in parallel and reduce latency. However, diffusion vision language models (dVLMs)…

Computer Vision and Pattern Recognition · Computer Science 2026-04-01 Lunbin Zeng , Jingfeng Yao , Bencheng Liao , Hongyuan Tao , Wenyu Liu , Xinggang Wang

VDLM: Variable Diffusion LMs via Robust Latent-to-Text Rendering

Autoregressive language models decode left-to-right with irreversible commitments, limiting revision during multi-step reasoning. We propose \textbf{VDLM}, a modular variable diffusion language model that separates semantic planning from…

Computation and Language · Computer Science 2026-02-19 Shuhui Qu

Large Language Model Guided Decoding for Self-Supervised Speech Recognition

Self-supervised automatic speech recognition (SSL-ASR) is an ASR approach that uses speech encoders pretrained on large amounts of unlabeled audio (e.g., wav2vec2.0 or HuBERT) and then fine-tunes them with limited labeled data to perform…

Audio and Speech Processing · Electrical Eng. & Systems 2026-01-07 Eyal Cohen , Bhiksha Raj , Joseph Keshet

Decoding Strategies for Diffusion-Based ASR: A Systematic Evaluation of Confidence-Based Thresholding

While LLM-based Automatic Speech Recognition (ASR) achieves high accuracy, its speed is limited by sequential autoregressive decoding. Diffusion Language Models (DLMs) offer a parallel alternative, yet their decoding strategies remain…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-29 Jeong Hun Yeo , Minsu Kim , Hyeongseop Rha , Yong Man Ro

DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation

Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models because their denoising models operate over the entire sequence. The global planning and iterative refinement features of dLLMs are…

Computation and Language · Computer Science 2025-06-27 Shansan Gong , Ruixiang Zhang , Huangjie Zheng , Jiatao Gu , Navdeep Jaitly , Lingpeng Kong , Yizhe Zhang

DMark: Order-Agnostic Watermarking for Diffusion Large Language Models

Diffusion large language models (dLLMs) offer faster generation than autoregressive models while maintaining comparable quality, but existing watermarking methods fail on them due to their non-sequential decoding. Unlike autoregressive…

Machine Learning · Computer Science 2025-10-06 Linyu Wu , Linhao Zhong , Wenjie Qu , Yuexin Li , Yue Liu , Shengfang Zhai , Chunhua Shen , Jiaheng Zhang

WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference

Autoregressive (AR) generation is the standard decoding paradigm for Large Language Models (LLMs), but its token-by-token nature limits parallelism at inference time. Diffusion Language Models (DLLMs) offer parallel decoding by recovering…

Computation and Language · Computer Science 2025-12-30 Aiwei Liu , Minghua He , Shaoxun Zeng , Sijun Zhang , Linhao Zhang , Chuhan Wu , Wei Jia , Yuan Liu , Xiao Zhou , Jie Zhou

Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach

Large language models (LLMs) have shown great promise for capturing contextual information in natural language processing tasks. We propose a novel approach to speaker diarization that incorporates the prowess of LLMs to exploit contextual…

Audio and Speech Processing · Electrical Eng. & Systems 2023-09-15 Tae Jin Park , Kunal Dhawan , Nithin Koluguri , Jagadeesh Balam

Finish First, Perfect Later: Test-Time Token-Level Cross-Validation for Diffusion Large Language Models

Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) models, offering advantages such as accelerated parallel decoding and bidirectional context modeling. However, the vanilla…

Computation and Language · Computer Science 2025-10-07 Runchu Tian , Junxia Cui , Xueqiang Xu , Feng Yao , Jingbo Shang

d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning

Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefits from online reinforcement learning (RL). These capabilities have primarily been demonstrated within the left-to-right autoregressive (AR)…

Computation and Language · Computer Science 2025-06-04 Siyan Zhao , Devaansh Gupta , Qinqing Zheng , Aditya Grover

Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

While autoregressive Large Vision-Language Models (VLMs) have achieved remarkable success, their sequential generation often limits their efficacy in complex visual planning and dynamic robotic control. In this work, we investigate the…

Computer Vision and Pattern Recognition · Computer Science 2026-01-06 Jiacheng Ye , Shansan Gong , Jiahui Gao , Junming Fan , Shuang Wu , Wei Bi , Haoli Bai , Lifeng Shang , Lingpeng Kong

Search or Accelerate: Confidence-Switched Position Beam Search for Diffusion Language Models

Diffusion Language Models (DLMs) generate text by iteratively denoising a masked sequence, repeatedly deciding which positions to commit at each step. Standard decoding follows a greedy rule: unmask the most confident positions, yet this…

Computation and Language · Computer Science 2026-02-26 Mingyu Cao , Alvaro H. C. Correia , Christos Louizos , Shiwei Liu , Lu Yin

Confidence-Based Decoding is Provably Efficient for Diffusion Language Models

Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models for language modeling, allowing flexible generation order and parallel generation of multiple tokens. However, this flexibility…

Machine Learning · Computer Science 2026-03-24 Changxiao Cai , Gen Li