Related papers: DiffiT: Diffusion Vision Transformers for Image Ge…

DyDiT++: Diffusion Transformers with Timestep and Spatial Dynamics for Efficient Visual Generation

Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the…

Computer Vision and Pattern Recognition · Computer Science 2026-01-15 Wangbo Zhao , Yizeng Han , Jiasheng Tang , Kai Wang , Hao Luo , Yibing Song , Gao Huang , Fan Wang , Yang You

Your ViT is Secretly a Hybrid Discriminative-Generative Diffusion Model

Diffusion Denoising Probability Models (DDPM) and Vision Transformer (ViT) have demonstrated significant progress in generative tasks and discriminative tasks, respectively, and thus far these models have largely been developed in their own…

Computer Vision and Pattern Recognition · Computer Science 2022-08-17 Xiulong Yang , Sheng-Min Shih , Yinlin Fu , Xiaoting Zhao , Shihao Ji

DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling

Diffusion Transformer (DiT), a promising diffusion model for visual generation, demonstrates impressive performance but incurs significant computational overhead. Intriguingly, analysis of pre-trained DiT models reveals that global…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Yuang Ai , Qihang Fan , Xuefeng Hu , Zhenheng Yang , Ran He , Huaibo Huang

Dynamic Diffusion Transformer

Diffusion Transformer (DiT), an emerging diffusion model for image generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs stem from the static inference…

Computer Vision and Pattern Recognition · Computer Science 2024-10-10 Wangbo Zhao , Yizeng Han , Jiasheng Tang , Kai Wang , Yibing Song , Gao Huang , Fan Wang , Yang You

FiT: Flexible Vision Transformer for Diffusion Model

Nature is infinitely resolution-free. In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To overcome this…

Computer Vision and Pattern Recognition · Computer Science 2024-10-16 Zeyu Lu , Zidong Wang , Di Huang , Chengyue Wu , Xihui Liu , Wanli Ouyang , Lei Bai

LaVin-DiT: Large Vision Diffusion Transformer

This paper presents the Large Vision Diffusion Transformer (LaVin-DiT), a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework. Unlike existing large vision models directly adapted…

Computer Vision and Pattern Recognition · Computer Science 2025-03-07 Zhaoqing Wang , Xiaobo Xia , Runnan Chen , Dongdong Yu , Changhu Wang , Mingming Gong , Tongliang Liu

EDiT: Efficient Diffusion Transformers with Linear Compressed Attention

Diffusion Transformers (DiTs) have emerged as a leading architecture for text-to-image synthesis, producing high-quality and photorealistic images. However, the quadratic scaling properties of the attention in DiTs hinder image generation…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Philipp Becker , Abhinav Mehrotra , Ruchika Chavhan , Malcolm Chadwick , Luca Morreale , Mehdi Noroozi , Alberto Gil Ramos , Sourav Bhattacharya

SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer

Diffusion Transformer (DiT) has emerged as the new trend of generative diffusion models on image generation. In view of extremely slow convergence in typical DiT, recent breakthroughs have been driven by mask strategy that significantly…

Computer Vision and Pattern Recognition · Computer Science 2024-03-26 Rui Zhu , Yingwei Pan , Yehao Li , Ting Yao , Zhenglong Sun , Tao Mei , Chang Wen Chen

xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism

Diffusion models are pivotal for generating high-quality images and videos. Inspired by the success of OpenAI's Sora, the backbone of diffusion models is evolving from U-Net to Transformer, known as Diffusion Transformers (DiTs). However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-05 Jiarui Fang , Jinzhe Pan , Xibo Sun , Aoyu Li , Jiannan Wang

GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation

Modern deep learning methods typically treat image sequences as large tensors of sequentially stacked frames. However, is this straightforward representation ideal given the current state-of-the-art (SoTA)? In this work, we address this…

Computer Vision and Pattern Recognition · Computer Science 2026-03-20 Snehal Singh Tomar , Alexandros Graikos , Arjun Krishna , Dimitris Samaras , Klaus Mueller

DriveDiTFit: Fine-tuning Diffusion Transformers for Autonomous Driving

In autonomous driving, deep models have shown remarkable performance across various visual perception tasks with the demand of high-quality and huge-diversity training datasets. Such datasets are expected to cover various driving scenarios…

Computer Vision and Pattern Recognition · Computer Science 2024-07-23 Jiahang Tu , Wei Ji , Hanbin Zhao , Chao Zhang , Roger Zimmermann , Hui Qian

Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer

Diffusion models have shown remarkable performance in image generation in recent years. However, due to a quadratic increase in memory during generating ultra-high-resolution images (e.g. 4096*4096), the resolution of generated images is…

Computer Vision and Pattern Recognition · Computer Science 2024-05-09 Zhuoyi Yang , Heyang Jiang , Wenyi Hong , Jiayan Teng , Wendi Zheng , Yuxiao Dong , Ming Ding , Jie Tang

LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation

In this paper, we investigate how to convert a pre-trained Diffusion Transformer (DiT) into a linear DiT, as its simplicity, parallelism, and efficiency for image generation. Through detailed exploration, we offer a suite of ready-to-use…

Computer Vision and Pattern Recognition · Computer Science 2025-09-29 Jiahao Wang , Ning Kang , Lewei Yao , Mengzhao Chen , Chengyue Wu , Songyang Zhang , Shuchen Xue , Yong Liu , Taiqiang Wu , Xihui Liu , Kaipeng Zhang , Shifeng Zhang , Wenqi Shao , Zhenguo Li , Ping Luo

PixelDiT: Pixel Diffusion Transformers for Image Generation

Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint…

Computer Vision and Pattern Recognition · Computer Science 2026-04-17 Yongsheng Yu , Wei Xiong , Weili Nie , Yichen Sheng , Shiqiu Liu , Jiebo Luo

SHIFT: Steering Hidden Intermediates in Flow Transformers

Diffusion models have become leading approaches for high-fidelity image generation. Recent DiT-based diffusion models, in particular, achieve strong prompt adherence while producing high-quality samples. We propose SHIFT, a simple but…

Computer Vision and Pattern Recognition · Computer Science 2026-04-13 Nina Konovalova , Andrey Kuznetsov , Aibek Alanov

Exploring Vision Transformers for Fine-grained Classification

Existing computer vision research in categorization struggles with fine-grained attributes recognition due to the inherently high intra-class variances and low inter-class variances. SOTA methods tackle this challenge by locating the most…

Computer Vision and Pattern Recognition · Computer Science 2021-07-01 Marcos V. Conde , Kerem Turgutlu

TerDiT: Ternary Diffusion Models with Transformers

Recent developments in large-scale pre-trained text-to-image diffusion models have significantly improved the generation of high-fidelity images, particularly with the emergence of diffusion transformer models (DiTs). Among diffusion…

Computer Vision and Pattern Recognition · Computer Science 2025-04-08 Xudong Lu , Aojun Zhou , Ziyi Lin , Qi Liu , Yuhui Xu , Renrui Zhang , Xue Yang , Junchi Yan , Peng Gao , Hongsheng Li

PViT: Prior-augmented Vision Transformer for Out-of-distribution Detection

Vision Transformers (ViTs) have achieved remarkable success over various vision tasks, yet their robustness against data distribution shifts and inherent inductive biases remain underexplored. To enhance the robustness of ViT models for…

Computer Vision and Pattern Recognition · Computer Science 2025-01-15 Tianhao Zhang , Zhixiang Chen , Lyudmila S. Mihaylova

DCTdiff: Intriguing Properties of Image Generative Modeling in the DCT Space

This paper explores image modeling from the frequency space and introduces DCTdiff, an end-to-end diffusion generative paradigm that efficiently models images in the discrete cosine transform (DCT) space. We investigate the design space of…

Computer Vision and Pattern Recognition · Computer Science 2025-06-02 Mang Ning , Mingxiao Li , Jianlin Su , Haozhe Jia , Lanmiao Liu , Martin Beneš , Wenshuo Chen , Albert Ali Salah , Itir Onal Ertugrul

Predict to Skip: Linear Multistep Feature Forecasting for Efficient Diffusion Transformers

Diffusion Transformers (DiT) have emerged as a widely adopted backbone for high-fidelity image and video generation, yet their iterative denoising process incurs high computational costs. Existing training-free acceleration methods rely on…

Computer Vision and Pattern Recognition · Computer Science 2026-02-23 Hanshuai Cui , Zhiqing Tang , Qianli Ma , Zhi Yao , Weijia Jia