English

Causal Diffusion Transformers for Generative Modeling

Computer Vision and Pattern Recognition 2024-12-18 v2

Abstract

We introduce Causal Diffusion as the autoregressive (AR) counterpart of Diffusion models. It is a next-token(s) forecasting framework that is friendly to both discrete and continuous modalities and compatible with existing next-token prediction models like LLaMA and GPT. While recent works attempt to combine diffusion with AR models, we show that introducing sequential factorization to a diffusion model can substantially improve its performance and enables a smooth transition between AR and diffusion generation modes. Hence, we propose CausalFusion - a decoder-only transformer that dual-factorizes data across sequential tokens and diffusion noise levels, leading to state-of-the-art results on the ImageNet generation benchmark while also enjoying the AR advantage of generating an arbitrary number of tokens for in-context reasoning. We further demonstrate CausalFusion's multimodal capabilities through a joint image generation and captioning model, and showcase CausalFusion's ability for zero-shot in-context image manipulations. We hope that this work could provide the community with a fresh perspective on training multimodal models over discrete and continuous data.

Keywords

Cite

@article{arxiv.2412.12095,
  title  = {Causal Diffusion Transformers for Generative Modeling},
  author = {Chaorui Deng and Deyao Zhu and Kunchang Li and Shi Guang and Haoqi Fan},
  journal= {arXiv preprint arXiv:2412.12095},
  year   = {2024}
}

Comments

22 figures, 21 pages

R2 v1 2026-06-28T20:37:33.713Z