Related papers: DiT-IC: Aligned Diffusion Transformer for Efficien…

D$^2$iT: Dynamic Diffusion Transformer for Accurate Image Generation

Diffusion models are widely recognized for their ability to generate high-fidelity images. Despite the excellent performance and scalability of the Diffusion Transformer (DiT) architecture, it applies fixed compression across different…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Weinan Jia , Mengqi Huang , Nan Chen , Lei Zhang , Zhendong Mao

ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

The Diffusion Transformer (DiT) architecture is the state-of-the-art paradigm for high-fidelity image generation, underpinning models like Stable Diffusion-3 and FLUX.1. However, deploying these models on resource-constrained mobile devices…

Computer Vision and Pattern Recognition · Computer Science 2026-05-18 Kunpeng Du , Haizhen Xie , Sen Lu , Lei Yu , Binglei Bao , Huaao Tang , Chuntao Liu , Hao Wu , Yang Zhao , Zhicai Huang , Heyuan Gao , Zhijun Tu , Jie Hu , Xinghao Chen

EDiT: Efficient Diffusion Transformers with Linear Compressed Attention

Diffusion Transformers (DiTs) have emerged as a leading architecture for text-to-image synthesis, producing high-quality and photorealistic images. However, the quadratic scaling properties of the attention in DiTs hinder image generation…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Philipp Becker , Abhinav Mehrotra , Ruchika Chavhan , Malcolm Chadwick , Luca Morreale , Mehdi Noroozi , Alberto Gil Ramos , Sourav Bhattacharya

One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to…

Computer Vision and Pattern Recognition · Computer Science 2026-03-13 Moayed Haji-Ali , Willi Menapace , Ivan Skorokhodov , Dogyun Park , Anil Kag , Michael Vasilkovsky , Sergey Tulyakov , Vicente Ordonez , Aliaksandr Siarohin

DC-DiT: Adaptive Compute and Elastic Inference for Visual Generation via Dynamic Chunking

Diffusion Transformers rely on static patchify tokenization, assigning the same token budget to smooth backgrounds, detailed object regions, noisy early timesteps, and late-stage refinements. We introduce the Dynamic Chunking Diffusion…

Computer Vision and Pattern Recognition · Computer Science 2026-05-08 Akash Haridas , Utkarsh Saxena , Parsa Ashrafi Fashi , Mehdi Rezagholizadeh , Vikram Appia , Emad Barsoum

DA-VAE: Plug-in Latent Compression for Diffusion via Detail Alignment

Reducing token count is crucial for efficient training and inference of latent diffusion models, especially at high resolution. A common strategy is to build high-compression image tokenizers with more channels per token. However, when…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Xin Cai , Zhiyuan You , Zhoutong Zhang , Tianfan Xue

DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution

Large-scale pre-trained diffusion models are becoming increasingly popular in solving the Real-World Image Super-Resolution (Real-ISR) problem because of their rich generative priors. The recent development of diffusion transformer (DiT)…

Computer Vision and Pattern Recognition · Computer Science 2025-07-08 Zheng-Peng Duan , Jiawei Zhang , Xin Jin , Ziheng Zhang , Zheng Xiong , Dongqing Zou , Jimmy S. Ren , Chun-Le Guo , Chongyi Li

Layout-Guided Controllable Pathology Image Generation with In-Context Diffusion Transformers

Controllable pathology image synthesis requires reliable regulation of spatial layout, tissue morphology, and semantic detail. However, existing text-guided diffusion models offer only coarse global control and lack the ability to enforce…

Computer Vision and Pattern Recognition · Computer Science 2026-03-17 Yuntao Shou , Xiangyong Cao , Qian Zhao , Deyu Meng

Dynamic Diffusion Transformer

Diffusion Transformer (DiT), an emerging diffusion model for image generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs stem from the static inference…

Computer Vision and Pattern Recognition · Computer Science 2024-10-10 Wangbo Zhao , Yizeng Han , Jiasheng Tang , Kai Wang , Yibing Song , Gao Huang , Fan Wang , Yang You

LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation

In this paper, we investigate how to convert a pre-trained Diffusion Transformer (DiT) into a linear DiT, as its simplicity, parallelism, and efficiency for image generation. Through detailed exploration, we offer a suite of ready-to-use…

Computer Vision and Pattern Recognition · Computer Science 2025-09-29 Jiahao Wang , Ning Kang , Lewei Yao , Mengzhao Chen , Chengyue Wu , Songyang Zhang , Shuchen Xue , Yong Liu , Taiqiang Wu , Xihui Liu , Kaipeng Zhang , Shifeng Zhang , Wenqi Shao , Zhenguo Li , Ping Luo

DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling

Diffusion Transformer (DiT), a promising diffusion model for visual generation, demonstrates impressive performance but incurs significant computational overhead. Intriguingly, analysis of pre-trained DiT models reveals that global…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Yuang Ai , Qihang Fan , Xuefeng Hu , Zhenheng Yang , Ran He , Huaibo Huang

Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers

Diffusion Transformers (DiTs) have achieved state-of-the-art (SOTA) image generation quality but suffer from high latency and memory inefficiency, making them difficult to deploy on resource-constrained devices. One major efficiency…

Computer Vision and Pattern Recognition · Computer Science 2025-03-28 Haoran You , Connelly Barnes , Yuqian Zhou , Yan Kang , Zhenbang Du , Wei Zhou , Lingzhi Zhang , Yotam Nitzan , Xiaoyang Liu , Zhe Lin , Eli Shechtman , Sohrab Amirghodsi , Yingyan Celine Lin

DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation

In this work, we empirically study Diffusion Transformers (DiTs) for text-to-image generation, focusing on architectural choices, text-conditioning strategies, and training protocols. We evaluate a range of DiT-based…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Chen Chen , Rui Qian , Wenze Hu , Tsu-Jui Fu , Jialing Tong , Xinze Wang , Lezhi Li , Bowen Zhang , Alex Schwing , Wei Liu , Yinfei Yang

Effective Diffusion Transformer Architecture for Image Super-Resolution

Recent advances indicate that diffusion models hold great promise in image super-resolution. While the latest methods are primarily based on latent diffusion models with convolutional neural networks, there are few attempts to explore…

Computer Vision and Pattern Recognition · Computer Science 2024-10-01 Kun Cheng , Lei Yu , Zhijun Tu , Xiao He , Liyu Chen , Yong Guo , Mingrui Zhu , Nannan Wang , Xinbo Gao , Jie Hu

Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer

Diffusion models have shown remarkable performance in image generation in recent years. However, due to a quadratic increase in memory during generating ultra-high-resolution images (e.g. 4096*4096), the resolution of generated images is…

Computer Vision and Pattern Recognition · Computer Science 2024-05-09 Zhuoyi Yang , Heyang Jiang , Wenyi Hong , Jiayan Teng , Wendi Zheng , Yuxiao Dong , Ming Ding , Jie Tang

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

Diffusion models with large-scale pre-training have achieved significant success in the field of visual content generation, particularly exemplified by Diffusion Transformers (DiT). However, DiT models have faced challenges with quadratic…

Computer Vision and Pattern Recognition · Computer Science 2024-11-28 Lianghui Zhu , Zilong Huang , Bencheng Liao , Jun Hao Liew , Hanshu Yan , Jiashi Feng , Xinggang Wang

Latent Space Disentanglement in Diffusion Transformers Enables Precise Zero-shot Semantic Editing

Diffusion Transformers (DiTs) have recently achieved remarkable success in text-guided image generation. In image editing, DiTs project text and image inputs to a joint latent space, from which they decode and synthesize new images.…

Computer Vision and Pattern Recognition · Computer Science 2024-11-14 Zitao Shuai , Chenwei Wu , Zhengxu Tang , Bowen Song , Liyue Shen

PixelDiT: Pixel Diffusion Transformers for Image Generation

Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint…

Computer Vision and Pattern Recognition · Computer Science 2026-04-17 Yongsheng Yu , Wei Xiong , Weili Nie , Yichen Sheng , Shiqiu Liu , Jiebo Luo

Towards Stabilized and Efficient Diffusion Transformers through Long-Skip-Connections with Spectral Constraints

Diffusion Transformers (DiT) have emerged as a powerful architecture for image and video generation, offering superior quality and scalability. However, their practical application suffers from inherent dynamic feature instability, leading…

Computer Vision and Pattern Recognition · Computer Science 2026-02-10 Guanjie Chen , Xinyu Zhao , Yucheng Zhou , Xiaoye Qu , Tianlong Chen , Yu Cheng

Diffusion Transformers with Representation Autoencoders

Latent generative modeling, where a pretrained autoencoder maps pixels into a latent space for the diffusion process, has become the standard strategy for Diffusion Transformers (DiT); however, the autoencoder component has barely evolved.…

Computer Vision and Pattern Recognition · Computer Science 2025-10-14 Boyang Zheng , Nanye Ma , Shengbang Tong , Saining Xie