Diffusion Language Models are Super Data Learners

Jinjie Ni; Qian Liu; Longxu Dou; Chao Du; Zili Wang; Hang Yan; Tianyu Pang; Michael Qizhe Shieh

Diffusion Language Models are Super Data Learners

Machine Learning 2025-11-06 v1

Authors: Jinjie Ni , Qian Liu , Longxu Dou , Chao Du , Zili Wang , Hang Yan , Tianyu Pang , Michael Qizhe Shieh

Abstract

Under strictly controlled pre-training settings, we observe a Crossover: when unique data is limited, diffusion language models (DLMs) consistently surpass autoregressive (AR) models by training for more epochs. The crossover shifts later with more or higher-quality data, earlier with larger models, and persists across dense and sparse architectures. We attribute the gains to three compounding factors: (1) any-order modeling, (2) super-dense compute from iterative bidirectional denoising, and (3) built-in Monte Carlo augmentation; input or parameter noise improves AR under data constraint but cannot close the gap. At scale, a 1.7B DLM trained with a ~1.5T-token compute budget on 10B unique Python tokens overtakes an AR coder trained with strictly matched settings. In addition, a 1B-parameter DLM achieves > 56% accuracy on HellaSwag and > 33% on MMLU using only 1B tokens, without any special tricks, just by repeating standard pre-training data. We also show that rising validation cross-entropy does not imply degraded downstream performance in this regime.

Keywords

large language model training instruction tuning deep learning

Cite

@article{arxiv.2511.03276,
  title  = {Diffusion Language Models are Super Data Learners},
  author = {Jinjie Ni and Qian Liu and Longxu Dou and Chao Du and Zili Wang and Hang Yan and Tianyu Pang and Michael Qizhe Shieh},
  journal= {arXiv preprint arXiv:2511.03276},
  year   = {2025}
}

Diffusion Language Models are Super Data Learners

Abstract

Keywords

Cite

Related papers