English
Related papers

Related papers: ConceptAttention: Diffusion Transformers Learn Hig…

200 papers

Diffusion Transformers (DiTs) have emerged as a leading architecture for text-to-image synthesis, producing high-quality and photorealistic images. However, the quadratic scaling properties of the attention in DiTs hinder image generation…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Philipp Becker , Abhinav Mehrotra , Ruchika Chavhan , Malcolm Chadwick , Luca Morreale , Mehdi Noroozi , Alberto Gil Ramos , Sourav Bhattacharya

Diffusion transformer-based video generation models (DiTs) have recently attracted widespread attention for their excellent generation quality. However, their computational cost remains a major bottleneck-attention alone accounts for over…

Computer Vision and Pattern Recognition · Computer Science 2025-05-22 Xuan Shen , Chenxia Han , Yufa Zhou , Yanyue Xie , Yifan Gong , Quanyi Wang , Yiwei Wang , Yanzhi Wang , Pu Zhao , Jiuxiang Gu

Text-to-image diffusion models excel at translating language prompts into photorealistic images by implicitly grounding textual concepts through their cross-modal attention mechanisms. Recent multi-modal diffusion transformers extend this…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Chaehyun Kim , Heeseong Shin , Eunbeen Hong , Heeji Yoon , Anurag Arnab , Paul Hongsuck Seo , Sunghwan Hong , Seungryong Kim

Starting from flow- and diffusion-based transformers, Multi-modal Diffusion Transformers (MM-DiTs) have reshaped text-to-vision generation, gaining acclaim for exceptional visual fidelity. As these models advance, users continually push the…

Artificial Intelligence · Computer Science 2025-10-07 Seil Kang , Woojung Han , Dayun Ju , Seong Jae Hwang

Diffusion Transformers, particularly for video generation, achieve remarkable quality but suffer from quadratic attention complexity, leading to prohibitive latency. Existing acceleration methods face a fundamental trade-off: dynamically…

Computer Vision and Pattern Recognition · Computer Science 2025-11-17 Dor Shmilovich , Tony Wu , Aviad Dahan , Yuval Domb

Contemporary diffusion models built upon U-Net or Diffusion Transformer (DiT) architectures have revolutionized image generation through transformer-based attention mechanisms. The prevailing paradigm has commonly employed self-attention…

Computer Vision and Pattern Recognition · Computer Science 2025-05-01 ZiYi Dong , Chengxing Zhou , Weijian Deng , Pengxu Wei , Xiangyang Ji , Liang Lin

Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to the quadratic complexity of self-attention operators. We propose DiTFastAttn, a post-training compression method to alleviate the…

Computer Vision and Pattern Recognition · Computer Science 2024-10-21 Zhihang Yuan , Hanling Zhang , Pu Lu , Xuefei Ning , Linfeng Zhang , Tianchen Zhao , Shengen Yan , Guohao Dai , Yu Wang

Diffusion Transformers (DiTs) have achieved remarkable success in diverse and high-quality text-to-image(T2I) generation. However, how text and image latents individually and jointly contribute to the semantics of generated images, remain…

Computer Vision and Pattern Recognition · Computer Science 2024-08-27 Zitao Shuai , Chenwei Wu , Zhengxu Tang , Bowen Song , Liyue Shen

Transformer-based diffusion models have recently superseded traditional U-Net architectures, with multimodal diffusion transformers (MM-DiT) emerging as the dominant approach in state-of-the-art models like Stable Diffusion 3 and Flux.1.…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Joonghyuk Shin , Alchan Hwang , Yujin Kim , Daneul Kim , Jaesik Park

Recent advances in diffusion transformers (DiTs) have set new standards in image generation, yet remain impractical for on-device deployment due to their high computational and memory costs. In this work, we present an efficient DiT…

Diffusion models are highly regarded for their controllability and the diversity of images they generate. However, class-conditional generation methods based on diffusion models often focus on more common categories. In large-scale…

Computer Vision and Pattern Recognition · Computer Science 2025-12-08 Kun Wang , Donglin Di , Tonghua Su , Lei Fan

Disentangled representation learning strives to extract the intrinsic factors within observed data. Factorizing these representations in an unsupervised manner is notably challenging and usually requires tailored loss functions or specific…

Computer Vision and Pattern Recognition · Computer Science 2024-06-13 Tao Yang , Cuiling Lan , Yan Lu , Nanning zheng

Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video…

Computer Vision and Pattern Recognition · Computer Science 2025-10-30 Ruichen Chen , Keith G. Mills , Liyao Jiang , Chao Gao , Di Niu

Diffusion Transformer (DiT), a promising diffusion model for visual generation, demonstrates impressive performance but incurs significant computational overhead. Intriguingly, analysis of pre-trained DiT models reveals that global…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Yuang Ai , Qihang Fan , Xuefeng Hu , Zhenheng Yang , Ran He , Huaibo Huang

Diffusion Transformers (DiTs) excel at generation, but their global self-attention makes controllable, reference-image-based editing a distinct challenge. Unlike U-Nets, naively injecting local appearance into a DiT can disrupt its holistic…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Shengrong Gu , Ye Wang , Song Wu , Rui Ma , Qian Wang , Lanjun Wang , Zili Yi

High-fidelity video generation remains challenging for diffusion models due to the difficulty of modeling complex spatio-temporal dynamics efficiently. Recent video diffusion methods typically represent a video as a sequence of…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Minh Khoa Le , Kien Do , Duc Thanh Nguyen , Truyen Tran

While Diffusion Transformers (DiTs) have achieved notable progress in video generation, this long-sequence generation task remains constrained by the quadratic complexity inherent to self-attention mechanisms, creating significant barriers…

Computer Vision and Pattern Recognition · Computer Science 2026-02-04 Yuxi Liu , Yipeng Hu , Zekun Zhang , Kunze Jiang , Kun Yuan

Diffusion transformers (DiTs) have emerged as a powerful architecture for high-fidelity image generation, yet the quadratic cost of self-attention poses a major scalability bottleneck. To address this, linear attention mechanisms have been…

Computer Vision and Pattern Recognition · Computer Science 2026-01-21 Boyuan Cao , Xingbo Yao , Chenhui Wang , Jiaxin Ye , Yujie Wei , Hongming Shan

As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training,…

Computer Vision and Pattern Recognition · Computer Science 2026-03-30 Benyuan Meng , Qianqian Xu , Zitai Wang , Xiaochun Cao , Longtao Huang , Qingming Huang

Recent advancements in diffusion models have notably improved the perceptual quality of generated images in text-to-image synthesis tasks. However, diffusion models often struggle to produce images that accurately reflect the intended…

Computer Vision and Pattern Recognition · Computer Science 2024-03-12 Yang Zhang , Teoh Tze Tzun , Lim Wei Hern , Tiviatis Sim , Kenji Kawaguchi
‹ Prev 1 2 3 10 Next ›