English
Related papers

Related papers: Exploring Transformer Backbones for Image Diffusio…

200 papers

We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the…

Computer Vision and Pattern Recognition · Computer Science 2023-03-03 William Peebles , Saining Xie

By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a…

Computer Vision and Pattern Recognition · Computer Science 2022-04-14 Robin Rombach , Andreas Blattmann , Dominik Lorenz , Patrick Esser , Björn Ommer

Diffusion-based Image Editing has achieved significant success in recent years. However, it remains challenging to achieve high-quality image editing while maintaining the background similarity without sacrificing speed or memory…

Graphics · Computer Science 2025-09-03 Siyi Liu , Weiming Chen , Yushun Tang , Zhihai He

Denosing diffusion model, as a generative model, has received a lot of attention in the field of image generation recently, thanks to its powerful generation capability. However, diffusion models have not yet received sufficient research in…

Computer Vision and Pattern Recognition · Computer Science 2023-04-12 ZiHan Cao , ShiQi Cao , Xiao Wu , JunMing Hou , Ran Ran , Liang-Jian Deng

Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. We design a simple and general ViT-based architecture (named U-ViT) for…

Computer Vision and Pattern Recognition · Computer Science 2023-03-28 Fan Bao , Shen Nie , Kaiwen Xue , Yue Cao , Chongxuan Li , Hang Su , Jun Zhu

While diffusion-based generative models have made significant strides in visual content creation, conventional approaches face computational challenges, especially for high-resolution images, as they denoise the entire image from noisy…

Computer Vision and Pattern Recognition · Computer Science 2025-07-01 Haohang Xu , Longyu Chen , Yichen Zhang , Shuangrui Ding , Zhipeng Zhang

Image fusion aims to blend complementary information from multiple sensing modalities, yet existing approaches remain limited in robustness, adaptability, and controllability. Most current fusion networks are tailored to specific tasks and…

Computer Vision and Pattern Recognition · Computer Science 2025-12-09 Jiayang Li , Chengjie Jiang , Junjun Jiang , Pengwei Liang , Jiayi Ma , Liqiang Nie

We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models. We achieve this on unconditional image synthesis by finding a better architecture through a series of ablations. For…

Machine Learning · Computer Science 2021-06-02 Prafulla Dhariwal , Alex Nichol

There has been tremendous progress in large-scale text-to-image synthesis driven by diffusion models enabling versatile downstream applications such as 3D object synthesis from texts, image editing, and customized generation. We present a…

Computer Vision and Pattern Recognition · Computer Science 2023-04-05 Ting-Hsuan Liao , Songwei Ge , Yiran Xu , Yao-Chih Lee , Badour AlBahar , Jia-Bin Huang

We present a one-shot text-to-image diffusion model that can generate high-resolution images from natural language descriptions. Our model employs a layered U-Net architecture that simultaneously synthesizes images at multiple resolution…

Computer Vision and Pattern Recognition · Computer Science 2024-07-09 Emaad Khwaja , Abdullah Rashwan , Ting Chen , Oliver Wang , Suraj Kothawade , Yeqing Li

Despite recent advances in UNet-based image editing, methods for shape-aware object editing in high-resolution images are still lacking. Compared to UNet, Diffusion Transformers (DiT) demonstrate superior capabilities to effectively capture…

Computer Vision and Pattern Recognition · Computer Science 2024-11-08 Kunyu Feng , Yue Ma , Bingyuan Wang , Chenyang Qi , Haozhe Chen , Qifeng Chen , Zeyu Wang

Latent diffusion models have become the popular choice for scaling up diffusion models for high resolution image synthesis. Compared to pixel-space models that are trained end-to-end, latent models are perceived to be more efficient and to…

Computer Vision and Pattern Recognition · Computer Science 2025-03-25 Emiel Hoogeboom , Thomas Mensink , Jonathan Heek , Kay Lamerigts , Ruiqi Gao , Tim Salimans

Vision Transformers and U-Net architectures have been widely adopted in the implementation of Diffusion Models. However, each architecture presents specific challenges while realizing them on-device. Vision Transformers require positional…

Computer Vision and Pattern Recognition · Computer Science 2025-09-05 Sanchar Palit , Sathya Veera Reddy Dendi , Mallikarjuna Talluri , Raj Narayana Gadde

Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint…

Computer Vision and Pattern Recognition · Computer Science 2026-04-17 Yongsheng Yu , Wei Xiong , Weili Nie , Yichen Sheng , Shiqiu Liu , Jiebo Luo

Diffusion Transformers (DiTs) have demonstrated exceptional capabilities in text-to-image synthesis. However, in the domain of controllable text-to-image generation using DiTs, most existing methods still rely on the ControlNet paradigm…

Computer Vision and Pattern Recognition · Computer Science 2025-08-15 Shanyuan Liu , Jian Zhu , Junda Lu , Yue Gong , Liuzhuozheng Li , Bo Cheng , Yuhang Ma , Liebucha Wu , Xiaoyu Wu , Dawei Leng , Yuhui Yin

Diffusion Transformers (DiTs) have emerged as a leading architecture for text-to-image synthesis, producing high-quality and photorealistic images. However, the quadratic scaling properties of the attention in DiTs hinder image generation…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Philipp Becker , Abhinav Mehrotra , Ruchika Chavhan , Malcolm Chadwick , Luca Morreale , Mehdi Noroozi , Alberto Gil Ramos , Sourav Bhattacharya

Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we observe that attribution-binding and compositional…

Computer Vision and Pattern Recognition · Computer Science 2023-03-02 Weixi Feng , Xuehai He , Tsu-Jui Fu , Varun Jampani , Arjun Akula , Pradyumna Narayana , Sugato Basu , Xin Eric Wang , William Yang Wang

Diffusion Transformers (DiTs) excel at generation, but their global self-attention makes controllable, reference-image-based editing a distinct challenge. Unlike U-Nets, naively injecting local appearance into a DiT can disrupt its holistic…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Shengrong Gu , Ye Wang , Song Wu , Rui Ma , Qian Wang , Lanjun Wang , Zili Yi

High-resolution image synthesis remains a core challenge in generative modeling, particularly in balancing computational efficiency with the preservation of fine-grained visual detail. We present Latent Wavelet Diffusion (LWD), a…

Computer Vision and Pattern Recognition · Computer Science 2026-04-17 Luigi Sigillo , Shengfeng He , Danilo Comminiello

We introduce FLEX (FLow EXpert), a backbone architecture for generative modeling of spatio-temporal physical systems using diffusion models. FLEX operates in the residual space rather than on raw data, a modeling choice that we motivate…

Machine Learning · Computer Science 2025-05-26 N. Benjamin Erichson , Vinicius Mikuni , Dongwei Lyu , Yang Gao , Omri Azencot , Soon Hoe Lim , Michael W. Mahoney
‹ Prev 1 2 3 10 Next ›