Related papers: Exploring Transformer Backbones for Image Diffusio…

Scalable Diffusion Models with Transformers

We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the…

Computer Vision and Pattern Recognition · Computer Science 2023-03-03 William Peebles , Saining Xie

High-Resolution Image Synthesis with Latent Diffusion Models

By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a…

Computer Vision and Pattern Recognition · Computer Science 2022-04-14 Robin Rombach , Andreas Blattmann , Dominik Lorenz , Patrick Esser , Björn Ommer

LatentEdit: Adaptive Latent Control for Consistent Semantic Editing

Diffusion-based Image Editing has achieved significant success in recent years. However, it remains challenging to achieve high-quality image editing while maintaining the background similarity without sacrificing speed or memory…

Graphics · Computer Science 2025-09-03 Siyi Liu , Weiming Chen , Yushun Tang , Zhihai He

DDRF: Denoising Diffusion Model for Remote Sensing Image Fusion

Denosing diffusion model, as a generative model, has received a lot of attention in the field of image generation recently, thanks to its powerful generation capability. However, diffusion models have not yet received sufficient research in…

Computer Vision and Pattern Recognition · Computer Science 2023-04-12 ZiHan Cao , ShiQi Cao , Xiao Wu , JunMing Hou , Ran Ran , Liang-Jian Deng

All are Worth Words: A ViT Backbone for Diffusion Models

Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. We design a simple and general ViT-based architecture (named U-ViT) for…

Computer Vision and Pattern Recognition · Computer Science 2023-03-28 Fan Bao , Shen Nie , Kaiwen Xue , Yue Cao , Chongxuan Li , Hang Su , Jun Zhu

MSF: Efficient Diffusion Model Via Multi-Scale Latent Factorize

While diffusion-based generative models have made significant strides in visual content creation, conventional approaches face computational challenges, especially for high-resolution images, as they denoise the entire image from noisy…

Computer Vision and Pattern Recognition · Computer Science 2025-07-01 Haohang Xu , Longyu Chen , Yichen Zhang , Shuangrui Ding , Zhipeng Zhang

Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach

Image fusion aims to blend complementary information from multiple sensing modalities, yet existing approaches remain limited in robustness, adaptability, and controllability. Most current fusion networks are tailored to specific tasks and…

Computer Vision and Pattern Recognition · Computer Science 2025-12-09 Jiayang Li , Chengjie Jiang , Junjun Jiang , Pengwei Liang , Jiayi Ma , Liqiang Nie

Diffusion Models Beat GANs on Image Synthesis

We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models. We achieve this on unconditional image synthesis by finding a better architecture through a series of ablations. For…

Machine Learning · Computer Science 2021-06-02 Prafulla Dhariwal , Alex Nichol

Text-driven Visual Synthesis with Latent Diffusion Prior

There has been tremendous progress in large-scale text-to-image synthesis driven by diffusion models enabling versatile downstream applications such as 3D object synthesis from texts, image editing, and customized generation. We present a…

Computer Vision and Pattern Recognition · Computer Science 2023-04-05 Ting-Hsuan Liao , Songwei Ge , Yiran Xu , Yao-Chih Lee , Badour AlBahar , Jia-Bin Huang

Layered Diffusion Model for One-Shot High Resolution Text-to-Image Synthesis

We present a one-shot text-to-image diffusion model that can generate high-resolution images from natural language descriptions. Our model employs a layered U-Net architecture that simultaneously synthesizes images at multiple resolution…

Computer Vision and Pattern Recognition · Computer Science 2024-07-09 Emaad Khwaja , Abdullah Rashwan , Ting Chen , Oliver Wang , Suraj Kothawade , Yeqing Li

DiT4Edit: Diffusion Transformer for Image Editing

Despite recent advances in UNet-based image editing, methods for shape-aware object editing in high-resolution images are still lacking. Compared to UNet, Diffusion Transformers (DiT) demonstrate superior capabilities to effectively capture…

Computer Vision and Pattern Recognition · Computer Science 2024-11-08 Kunyu Feng , Yue Ma , Bingyuan Wang , Chenyang Qi , Haozhe Chen , Qifeng Chen , Zeyu Wang

Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion

Latent diffusion models have become the popular choice for scaling up diffusion models for high resolution image synthesis. Compared to pixel-space models that are trained end-to-end, latent models are perceived to be more efficient and to…

Computer Vision and Pattern Recognition · Computer Science 2025-03-25 Emiel Hoogeboom , Thomas Mensink , Jonathan Heek , Kay Lamerigts , Ruiqi Gao , Tim Salimans

Hardware-Friendly Diffusion Models with Fixed-Size Reusable Structures for On-Device Image Generation

Vision Transformers and U-Net architectures have been widely adopted in the implementation of Diffusion Models. However, each architecture presents specific challenges while realizing them on-device. Vision Transformers require positional…

Computer Vision and Pattern Recognition · Computer Science 2025-09-05 Sanchar Palit , Sathya Veera Reddy Dendi , Mallikarjuna Talluri , Raj Narayana Gadde

PixelDiT: Pixel Diffusion Transformers for Image Generation

Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint…

Computer Vision and Pattern Recognition · Computer Science 2026-04-17 Yongsheng Yu , Wei Xiong , Weili Nie , Yichen Sheng , Shiqiu Liu , Jiebo Luo

NanoControl: A Lightweight Framework for Precise and Efficient Control in Diffusion Transformer

Diffusion Transformers (DiTs) have demonstrated exceptional capabilities in text-to-image synthesis. However, in the domain of controllable text-to-image generation using DiTs, most existing methods still rely on the ControlNet paradigm…

Computer Vision and Pattern Recognition · Computer Science 2025-08-15 Shanyuan Liu , Jian Zhu , Junda Lu , Yue Gong , Liuzhuozheng Li , Bo Cheng , Yuhang Ma , Liebucha Wu , Xiaoyu Wu , Dawei Leng , Yuhui Yin

EDiT: Efficient Diffusion Transformers with Linear Compressed Attention

Diffusion Transformers (DiTs) have emerged as a leading architecture for text-to-image synthesis, producing high-quality and photorealistic images. However, the quadratic scaling properties of the attention in DiTs hinder image generation…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Philipp Becker , Abhinav Mehrotra , Ruchika Chavhan , Malcolm Chadwick , Luca Morreale , Mehdi Noroozi , Alberto Gil Ramos , Sourav Bhattacharya

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we observe that attribution-binding and compositional…

Computer Vision and Pattern Recognition · Computer Science 2023-03-02 Weixi Feng , Xuehai He , Tsu-Jui Fu , Varun Jampani , Arjun Akula , Pradyumna Narayana , Sugato Basu , Xin Eric Wang , William Yang Wang

A training-free framework for high-fidelity appearance transfer via diffusion transformers

Diffusion Transformers (DiTs) excel at generation, but their global self-attention makes controllable, reference-image-based editing a distinct challenge. Unlike U-Nets, naively injecting local appearance into a DiT can disrupt its holistic…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Shengrong Gu , Ye Wang , Song Wu , Rui Ma , Qian Wang , Lanjun Wang , Zili Yi

Latent Wavelet Diffusion For Ultra-High-Resolution Image Synthesis

High-resolution image synthesis remains a core challenge in generative modeling, particularly in balancing computational efficiency with the preservation of fine-grained visual detail. We present Latent Wavelet Diffusion (LWD), a…

Computer Vision and Pattern Recognition · Computer Science 2026-04-17 Luigi Sigillo , Shengfeng He , Danilo Comminiello

FLEX: A Backbone for Diffusion-Based Modeling of Spatio-temporal Physical Systems

We introduce FLEX (FLow EXpert), a backbone architecture for generative modeling of spatio-temporal physical systems using diffusion models. FLEX operates in the residual space rather than on raw data, a modeling choice that we motivate…

Machine Learning · Computer Science 2025-05-26 N. Benjamin Erichson , Vinicius Mikuni , Dongwei Lyu , Yang Gao , Omri Azencot , Soon Hoe Lim , Michael W. Mahoney