Related papers: ConceptAttention: Diffusion Transformers Learn Hig…

EDiT: Efficient Diffusion Transformers with Linear Compressed Attention

Diffusion Transformers (DiTs) have emerged as a leading architecture for text-to-image synthesis, producing high-quality and photorealistic images. However, the quadratic scaling properties of the attention in DiTs hinder image generation…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Philipp Becker , Abhinav Mehrotra , Ruchika Chavhan , Malcolm Chadwick , Luca Morreale , Mehdi Noroozi , Alberto Gil Ramos , Sourav Bhattacharya

DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance

Diffusion transformer-based video generation models (DiTs) have recently attracted widespread attention for their excellent generation quality. However, their computational cost remains a major bottleneck-attention alone accounts for over…

Computer Vision and Pattern Recognition · Computer Science 2025-05-22 Xuan Shen , Chenxia Han , Yufa Zhou , Yanyue Xie , Yifan Gong , Quanyi Wang , Yiwei Wang , Yanzhi Wang , Pu Zhao , Jiuxiang Gu

Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers

Text-to-image diffusion models excel at translating language prompts into photorealistic images by implicitly grounding textual concepts through their cross-modal attention mechanisms. Recent multi-modal diffusion transformers extend this…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Chaehyun Kim , Heeseong Shin , Eunbeen Hong , Heeji Yoon , Anurag Arnab , Paul Hongsuck Seo , Sunghwan Hong , Seungryong Kim

Rare Text Semantics Were Always There in Your Diffusion Transformer

Starting from flow- and diffusion-based transformers, Multi-modal Diffusion Transformers (MM-DiTs) have reshaped text-to-vision generation, gaining acclaim for exceptional visual fidelity. As these models advance, users continually push the…

Artificial Intelligence · Computer Science 2025-10-07 Seil Kang , Woojung Han , Dayun Ju , Seong Jae Hwang

LiteAttention: A Temporal Sparse Attention for Diffusion Transformers

Diffusion Transformers, particularly for video generation, achieve remarkable quality but suffer from quadratic attention complexity, leading to prohibitive latency. Existing acceleration methods face a fundamental trade-off: dynamically…

Computer Vision and Pattern Recognition · Computer Science 2025-11-17 Dor Shmilovich , Tony Wu , Aviad Dahan , Yuval Domb

Can We Achieve Efficient Diffusion without Self-Attention? Distilling Self-Attention into Convolutions

Contemporary diffusion models built upon U-Net or Diffusion Transformer (DiT) architectures have revolutionized image generation through transformer-based attention mechanisms. The prevailing paradigm has commonly employed self-attention…

Computer Vision and Pattern Recognition · Computer Science 2025-05-01 ZiYi Dong , Chengxing Zhou , Weijian Deng , Pengxu Wei , Xiangyang Ji , Liang Lin

DiTFastAttn: Attention Compression for Diffusion Transformer Models

Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to the quadratic complexity of self-attention operators. We propose DiTFastAttn, a post-training compression method to alleviate the…

Computer Vision and Pattern Recognition · Computer Science 2024-10-21 Zhihang Yuan , Hanling Zhang , Pu Lu , Xuefei Ning , Linfeng Zhang , Tianchen Zhao , Shengen Yan , Guohao Dai , Yu Wang

Latent Space Disentanglement in Diffusion Transformers Enables Zero-shot Fine-grained Semantic Editing

Diffusion Transformers (DiTs) have achieved remarkable success in diverse and high-quality text-to-image(T2I) generation. However, how text and image latents individually and jointly contribute to the semantics of generated images, remain…

Computer Vision and Pattern Recognition · Computer Science 2024-08-27 Zitao Shuai , Chenwei Wu , Zhengxu Tang , Bowen Song , Liyue Shen

Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing

Transformer-based diffusion models have recently superseded traditional U-Net architectures, with multimodal diffusion transformers (MM-DiT) emerging as the dominant approach in state-of-the-art models like Stable Diffusion 3 and Flux.1.…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Joonghyuk Shin , Alchan Hwang , Yujin Kim , Daneul Kim , Jaesik Park

SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices

Recent advances in diffusion transformers (DiTs) have set new standards in image generation, yet remain impractical for on-device deployment due to their high computational and memory costs. In this work, we present an efficient DiT…

Computer Vision and Pattern Recognition · Computer Science 2026-02-12 Dongting Hu , Aarush Gupta , Magzhan Gabidolla , Arpit Sahni , Huseyin Coskun , Yanyu Li , Yerlan Idelbayev , Ahsan Mahmood , Aleksei Lebedev , Dishani Lahiri , Anujraaj Goyal , Ju Hu , Mingming Gong , Sergey Tulyakov , Anil Kag

EFDiT: Efficient Fine-grained Image Generation Using Diffusion Transformer Models

Diffusion models are highly regarded for their controllability and the diversity of images they generate. However, class-conditional generation methods based on diffusion models often focus on more common categories. In large-scale…

Computer Vision and Pattern Recognition · Computer Science 2025-12-08 Kun Wang , Donglin Di , Tonghua Su , Lei Fan

Diffusion Model with Cross Attention as an Inductive Bias for Disentanglement

Disentangled representation learning strives to extract the intrinsic factors within observed data. Factorizing these representations in an unsupervised manner is notably challenging and usually requires tailored loss functions or specific…

Computer Vision and Pattern Recognition · Computer Science 2024-06-13 Tao Yang , Cuiling Lan , Yan Lu , Nanning zheng

Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video…

Computer Vision and Pattern Recognition · Computer Science 2025-10-30 Ruichen Chen , Keith G. Mills , Liyao Jiang , Chao Gao , Di Niu

DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling

Diffusion Transformer (DiT), a promising diffusion model for visual generation, demonstrates impressive performance but incurs significant computational overhead. Intriguingly, analysis of pre-trained DiT models reveals that global…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Yuang Ai , Qihang Fan , Xuefeng Hu , Zhenheng Yang , Ran He , Huaibo Huang

A training-free framework for high-fidelity appearance transfer via diffusion transformers

Diffusion Transformers (DiTs) excel at generation, but their global self-attention makes controllable, reference-image-based editing a distinct challenge. Unlike U-Nets, naively injecting local appearance into a DiT can disrupt its holistic…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Shengrong Gu , Ye Wang , Song Wu , Rui Ma , Qian Wang , Lanjun Wang , Zili Yi

FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation

High-fidelity video generation remains challenging for diffusion models due to the difficulty of modeling complex spatio-temporal dynamics efficiently. Recent video diffusion methods typically represent a video as a sequence of…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Minh Khoa Le , Kien Do , Duc Thanh Nguyen , Truyen Tran

Mixture of Distributions Matters: Dynamic Sparse Attention for Efficient Video Diffusion Transformers

While Diffusion Transformers (DiTs) have achieved notable progress in video generation, this long-sequence generation task remains constrained by the quadratic complexity inherent to self-attention mechanisms, creating significant barriers…

Computer Vision and Pattern Recognition · Computer Science 2026-02-04 Yuxi Liu , Yipeng Hu , Zekun Zhang , Kunze Jiang , Kun Yuan

Dynamic Differential Linear Attention: Enhancing Linear Diffusion Transformer for High-Quality Image Generation

Diffusion transformers (DiTs) have emerged as a powerful architecture for high-fidelity image generation, yet the quadratic cost of self-attention poses a major scalability bottleneck. To address this, linear attention mechanisms have been…

Computer Vision and Pattern Recognition · Computer Science 2026-01-21 Boyuan Cao , Xingbo Yao , Chenhui Wang , Jiaxin Ye , Yujie Wei , Hongming Shan

Making Training-Free Diffusion Segmentors Scale with the Generative Power

As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training,…

Computer Vision and Pattern Recognition · Computer Science 2026-03-30 Benyuan Meng , Qianqian Xu , Zitai Wang , Xiaochun Cao , Longtao Huang , Qingming Huang

Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models

Recent advancements in diffusion models have notably improved the perceptual quality of generated images in text-to-image synthesis tasks. However, diffusion models often struggle to produce images that accurately reflect the intended…

Computer Vision and Pattern Recognition · Computer Science 2024-03-12 Yang Zhang , Teoh Tze Tzun , Lim Wei Hern , Tiviatis Sim , Kenji Kawaguchi