HomeMachine LearningarXiv:2605.29607

Cluster-Level Attention-Guided Parallel Decoding for Masked Diffusion Language Models

Machine Learning2026-05v1license

Abstract

Masked diffusion language models (MDLMs) enable parallel decoding by predicting all masked positions at each denoising step, yet existing training-free samplers usually decide which positions to commit at token-level granularity. We revisit this granularity and observe that reliable predictions often emerge as contiguous high-confidence spans, suggesting that the unit of parallel commitment can be larger than a single token. We first group adjacent high-confidence candidates into confidence-induced clusters (CICs) as span-level update units. We then use self-attention maps from the same forward pass to estimate inter-cluster dependencies, enabling conflict-aware selection of mutually compatible CICs for parallel commitment. This yields CLAD (Cluster-Level Attention-Guided Decoding), a training-free cluster-level decoder for MDLMs. Experiments on LLaDA and Dream model families across four reasoning and code-generation benchmarks show that CLAD achieves 1.77x--8.47x speedups over Vanilla decoding while maintaining broadly comparable task accuracy in most settings.

Cite

@article{arxiv.2605.29607,
  title  = {Cluster-Level Attention-Guided Parallel Decoding for Masked Diffusion Language Models},
  author = {Heqiang Qi and Wei Huang and Mingyuan Bai and Xiangming Meng},
  journal= {arXiv preprint arXiv:2605.29607},
  year   = {2026}
}