Related papers: Scalable Diffusion Models with Transformers

Scalable Diffusion Models with State Space Backbone

This paper presents a new exploration into a category of diffusion models built upon state space architecture. We endeavor to train diffusion models for image data, wherein the traditional U-Net backbone is supplanted by a state space…

Computer Vision and Pattern Recognition · Computer Science 2024-03-29 Zhengcong Fei , Mingyuan Fan , Changqian Yu , Junshi Huang

TerDiT: Ternary Diffusion Models with Transformers

Recent developments in large-scale pre-trained text-to-image diffusion models have significantly improved the generation of high-fidelity images, particularly with the emergence of diffusion transformer models (DiTs). Among diffusion…

Computer Vision and Pattern Recognition · Computer Science 2025-04-08 Xudong Lu , Aojun Zhou , Ziyi Lin , Qi Liu , Yuhui Xu , Renrui Zhang , Xue Yang , Junchi Yan , Peng Gao , Hongsheng Li

U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers

Diffusion Transformers (DiTs) introduce the transformer architecture to diffusion tasks for latent-space image generation. With an isotropic architecture that chains a series of transformer blocks, DiTs demonstrate competitive performance…

Computer Vision and Pattern Recognition · Computer Science 2024-10-31 Yuchuan Tian , Zhijun Tu , Hanting Chen , Jie Hu , Chao Xu , Yunhe Wang

Efficient Scaling of Diffusion Transformers for Text-to-Image Generation

We empirically study the scaling properties of various Diffusion Transformers (DiTs) for text-to-image generation by performing extensive and rigorous ablations, including training scaled DiTs ranging from 0.3B upto 8B parameters on…

Computer Vision and Pattern Recognition · Computer Science 2024-12-18 Hao Li , Shamit Lal , Zhiheng Li , Yusheng Xie , Ying Wang , Yang Zou , Orchid Majumder , R. Manmatha , Zhuowen Tu , Stefano Ermon , Stefano Soatto , Ashwin Swaminathan

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

Diffusion models with large-scale pre-training have achieved significant success in the field of visual content generation, particularly exemplified by Diffusion Transformers (DiT). However, DiT models have faced challenges with quadratic…

Computer Vision and Pattern Recognition · Computer Science 2024-11-28 Lianghui Zhu , Zilong Huang , Bencheng Liao , Jun Hao Liew , Hanshu Yan , Jiashi Feng , Xinggang Wang

One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to…

Computer Vision and Pattern Recognition · Computer Science 2026-03-13 Moayed Haji-Ali , Willi Menapace , Ivan Skorokhodov , Dogyun Park , Anil Kag , Michael Vasilkovsky , Sergey Tulyakov , Vicente Ordonez , Aliaksandr Siarohin

Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion

Latent diffusion models have become the popular choice for scaling up diffusion models for high resolution image synthesis. Compared to pixel-space models that are trained end-to-end, latent models are perceived to be more efficient and to…

Computer Vision and Pattern Recognition · Computer Science 2025-03-25 Emiel Hoogeboom , Thomas Mensink , Jonathan Heek , Kay Lamerigts , Ruiqi Gao , Tim Salimans

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

We present Scalable Interpolant Transformers (SiT), a family of generative models built on the backbone of Diffusion Transformers (DiT). The interpolant framework, which allows for connecting two distributions in a more flexible way than…

Computer Vision and Pattern Recognition · Computer Science 2024-09-24 Nanye Ma , Mark Goldstein , Michael S. Albergo , Nicholas M. Boffi , Eric Vanden-Eijnden , Saining Xie

On the Scalability of Diffusion-based Text-to-Image Generation

Scaling up model and data size has been quite successful for the evolution of LLMs. However, the scaling law for the diffusion based text-to-image (T2I) models is not fully explored. It is also unclear how to efficiently scale the model for…

Computer Vision and Pattern Recognition · Computer Science 2024-04-04 Hao Li , Yang Zou , Ying Wang , Orchid Majumder , Yusheng Xie , R. Manmatha , Ashwin Swaminathan , Zhuowen Tu , Stefano Ermon , Stefano Soatto

Scaling Diffusion Transformers Efficiently via $\mu$P

Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($\mu$P) was…

Machine Learning · Computer Science 2025-11-03 Chenyu Zheng , Xinyu Zhang , Rongzhen Wang , Wei Huang , Zhi Tian , Weilin Huang , Jun Zhu , Chongxuan Li

Exploring Transformer Backbones for Image Diffusion Models

We present an end-to-end Transformer based Latent Diffusion model for image synthesis. On the ImageNet class conditioned generation task we show that a Transformer based Latent Diffusion model achieves a 14.1FID which is comparable to the…

Computer Vision and Pattern Recognition · Computer Science 2023-01-02 Princy Chahal

PixelDiT: Pixel Diffusion Transformers for Image Generation

Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint…

Computer Vision and Pattern Recognition · Computer Science 2026-04-17 Yongsheng Yu , Wei Xiong , Weili Nie , Yichen Sheng , Shiqiu Liu , Jiebo Luo

DiT4Edit: Diffusion Transformer for Image Editing

Despite recent advances in UNet-based image editing, methods for shape-aware object editing in high-resolution images are still lacking. Compared to UNet, Diffusion Transformers (DiT) demonstrate superior capabilities to effectively capture…

Computer Vision and Pattern Recognition · Computer Science 2024-11-08 Kunyu Feng , Yue Ma , Bingyuan Wang , Chenyang Qi , Haozhe Chen , Qifeng Chen , Zeyu Wang

Effective Diffusion Transformer Architecture for Image Super-Resolution

Recent advances indicate that diffusion models hold great promise in image super-resolution. While the latest methods are primarily based on latent diffusion models with convolutional neural networks, there are few attempts to explore…

Computer Vision and Pattern Recognition · Computer Science 2024-10-01 Kun Cheng , Lei Yu , Zhijun Tu , Xiao He , Liyu Chen , Yong Guo , Mingrui Zhu , Nannan Wang , Xinbo Gao , Jie Hu

Scaling Laws For Diffusion Transformers

Diffusion transformers (DiT) have already achieved appealing synthesis and scaling properties in content recreation, e.g., image and video generation. However, scaling laws of DiT are less explored, which usually offer precise predictions…

Computer Vision and Pattern Recognition · Computer Science 2026-03-05 Zhengyang Liang , Hao He , Ceyuan Yang , Bo Dai

Unveiling Redundancy in Diffusion Transformers (DiTs): A Systematic Study

The increased model capacity of Diffusion Transformers (DiTs) and the demand for generating higher resolutions of images and videos have led to a significant rise in inference latency, impacting real-time performance adversely. While prior…

Computer Vision and Pattern Recognition · Computer Science 2024-11-22 Xibo Sun , Jiarui Fang , Aoyu Li , Jinzhe Pan

A training-free framework for high-fidelity appearance transfer via diffusion transformers

Diffusion Transformers (DiTs) excel at generation, but their global self-attention makes controllable, reference-image-based editing a distinct challenge. Unlike U-Nets, naively injecting local appearance into a DiT can disrupt its holistic…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Shengrong Gu , Ye Wang , Song Wu , Rui Ma , Qian Wang , Lanjun Wang , Zili Yi

$\Delta$-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers

Diffusion models are widely recognized for generating high-quality and diverse images, but their poor real-time performance has led to numerous acceleration works, primarily focusing on UNet-based structures. With the more successful…

Computer Vision and Pattern Recognition · Computer Science 2024-06-04 Pengtao Chen , Mingzhu Shen , Peng Ye , Jianjian Cao , Chongjun Tu , Christos-Savvas Bouganis , Yiren Zhao , Tao Chen

Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

Pre-trained stable diffusion models (SD) have shown great advances in visual correspondence. In this paper, we investigate the capabilities of Diffusion Transformers (DiTs) for accurate dense correspondence. Distinct from SD, DiTs exhibit a…

Computer Vision and Pattern Recognition · Computer Science 2025-11-11 Chaofan Gan , Yuanpeng Tu , Xi Chen , Tieyuan Chen , Yuxi Li , Mehrtash Harandi , Weiyao Lin

SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices

Recent advances in diffusion transformers (DiTs) have set new standards in image generation, yet remain impractical for on-device deployment due to their high computational and memory costs. In this work, we present an efficient DiT…

Computer Vision and Pattern Recognition · Computer Science 2026-02-12 Dongting Hu , Aarush Gupta , Magzhan Gabidolla , Arpit Sahni , Huseyin Coskun , Yanyu Li , Yerlan Idelbayev , Ahsan Mahmood , Aleksei Lebedev , Dishani Lahiri , Anujraaj Goyal , Ju Hu , Mingming Gong , Sergey Tulyakov , Anil Kag