Related papers: Efficient-vDiT: Efficient Video Diffusion Transfor…

Fast Video Generation with Sliding Tile Attention

Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total…

Computer Vision and Pattern Recognition · Computer Science 2025-06-06 Peiyuan Zhang , Yongqi Chen , Runlong Su , Hangliang Ding , Ion Stoica , Zhengzhong Liu , Hao Zhang

DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance

Diffusion transformer-based video generation models (DiTs) have recently attracted widespread attention for their excellent generation quality. However, their computational cost remains a major bottleneck-attention alone accounts for over…

Computer Vision and Pattern Recognition · Computer Science 2025-05-22 Xuan Shen , Chenxia Han , Yufa Zhou , Yanyue Xie , Yifan Gong , Quanyi Wang , Yiwei Wang , Yanzhi Wang , Pu Zhao , Jiuxiang Gu

Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers

While Diffusion Transformers (DiTs) have achieved breakthroughs in video generation, this long sequence generation task remains constrained by the quadratic complexity of attention mechanisms, resulting in significant inference latency.…

Computer Vision and Pattern Recognition · Computer Science 2025-06-04 Pengtao Chen , Xianfang Zeng , Maosen Zhao , Peng Ye , Mingzhu Shen , Wei Cheng , Gang Yu , Tao Chen

DFSAttn: Dynamic Fine-grained Sparse Attention for Efficient Video Generation

Diffusion transformers have achieved remarkable success in high-quality video generation, yet their reliance on spatiotemporal 3D full attention incurs prohibitive computational cost due to the quadratic complexity of attention. Block…

Computer Vision and Pattern Recognition · Computer Science 2026-05-25 Jie Hu , Zixiang Gao , Yutong He , Kun Yuan

DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training

Diffusion Transformers (DiTs) have shown remarkable performance in generating high-quality videos. However, the quadratic complexity of 3D full attention remains a bottleneck in scaling DiT training, especially with high-definition, lengthy…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-02 Xin Tan , Yuetao Chen , Yimin Jiang , Xing Chen , Kun Yan , Nan Duan , Yibo Zhu , Daxin Jiang , Hong Xu

VSA: Faster Video Diffusion with Trainable Sparse Attention

Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient…

Computer Vision and Pattern Recognition · Computer Science 2025-10-29 Peiyuan Zhang , Yongqi Chen , Haofeng Huang , Will Lin , Zhengzhong Liu , Ion Stoica , Eric Xing , Hao Zhang

Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video…

Computer Vision and Pattern Recognition · Computer Science 2025-10-30 Ruichen Chen , Keith G. Mills , Liyao Jiang , Chao Gao , Di Niu

DiTFastAttn: Attention Compression for Diffusion Transformer Models

Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to the quadratic complexity of self-attention operators. We propose DiTFastAttn, a post-training compression method to alleviate the…

Computer Vision and Pattern Recognition · Computer Science 2024-10-21 Zhihang Yuan , Hanling Zhang , Pu Lu , Xuefei Ning , Linfeng Zhang , Tianchen Zhao , Shengen Yan , Guohao Dai , Yu Wang

Taming Diffusion Transformer for Efficient Mobile Video Generation in Seconds

Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones, and practical on-device generation is even…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Yushu Wu , Yanyu Li , Anil Kag , Ivan Skorokhodov , Willi Menapace , Ke Ma , Arpit Sahni , Ju Hu , Aliaksandr Siarohin , Dhritiman Sagar , Yanzhi Wang , Sergey Tulyakov

VORTA: Efficient Video Diffusion via Routing Sparse Attention

Video diffusion transformers have achieved remarkable progress in high-quality video generation, but remain computationally expensive due to the quadratic complexity of attention over high-dimensional video sequences. Recent acceleration…

Computer Vision and Pattern Recognition · Computer Science 2025-10-14 Wenhao Sun , Rong-Cheng Tu , Yifu Ding , Zhao Jin , Jingyi Liao , Shunyu Liu , Dacheng Tao

Mixture of Distributions Matters: Dynamic Sparse Attention for Efficient Video Diffusion Transformers

While Diffusion Transformers (DiTs) have achieved notable progress in video generation, this long-sequence generation task remains constrained by the quadratic complexity inherent to self-attention mechanisms, creating significant barriers…

Computer Vision and Pattern Recognition · Computer Science 2026-02-04 Yuxi Liu , Yipeng Hu , Zekun Zhang , Kunze Jiang , Kun Yuan

Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity

Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability, usually requiring tens of minutes to generate a few seconds of video even on high-performance GPUs. This…

Computer Vision and Pattern Recognition · Computer Science 2025-04-29 Haocheng Xi , Shuo Yang , Yilong Zhao , Chenfeng Xu , Muyang Li , Xiuyu Li , Yujun Lin , Han Cai , Jintao Zhang , Dacheng Li , Jianfei Chen , Ion Stoica , Kurt Keutzer , Song Han

FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation

High-fidelity video generation remains challenging for diffusion models due to the difficulty of modeling complex spatio-temporal dynamics efficiently. Recent video diffusion methods typically represent a video as a sequence of…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Minh Khoa Le , Kien Do , Duc Thanh Nguyen , Truyen Tran

Bidirectional Sparse Attention for Faster Video Diffusion Training

Video diffusion Transformer (DiT) models excel in generative quality but hit major computational bottlenecks when producing high-resolution, long-duration videos. The quadratic complexity of full attention leads to prohibitively high…

Computer Vision and Pattern Recognition · Computer Science 2026-01-01 Chenlu Zhan , Wen Li , Chuyu Shen , Jun Zhang , Suhui Wu , Hao Zhang

Designing Parameter and Compute Efficient Diffusion Transformers using Distillation

Diffusion Transformers (DiTs) with billions of model parameters form the backbone of popular image and video generation models like DALL.E, Stable-Diffusion and SORA. Though these models are necessary in many low-latency applications like…

Computer Vision and Pattern Recognition · Computer Science 2025-02-21 Vignesh Sundaresha

EDiT: Efficient Diffusion Transformers with Linear Compressed Attention

Diffusion Transformers (DiTs) have emerged as a leading architecture for text-to-image synthesis, producing high-quality and photorealistic images. However, the quadratic scaling properties of the attention in DiTs hinder image generation…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Philipp Becker , Abhinav Mehrotra , Ruchika Chavhan , Malcolm Chadwick , Luca Morreale , Mehdi Noroozi , Alberto Gil Ramos , Sourav Bhattacharya

RainFusion: Adaptive Video Generation Acceleration via Multi-Dimensional Visual Redundancy

Video generation using diffusion models is highly computationally intensive, with 3D attention in Diffusion Transformer (DiT) models accounting for over 80\% of the total computational resources. In this work, we introduce {\bf RainFusion},…

Computer Vision and Pattern Recognition · Computer Science 2025-06-10 Aiyue Chen , Bin Dong , Jingru Li , Jing Lin , Kun Tian , Yiwu Yao , Gongyi Wang

SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer

Diffusion Transformers have demonstrated remarkable performance in video generation. However, their long input sequences incur substantial latency due to the quadratic complexity of full attention. Various sparse attention mechanisms have…

Computer Vision and Pattern Recognition · Computer Science 2026-04-03 Tongcheng Fang , Hanling Zhang , Ruiqi Xie , Zhuo Han , Xin Tao , Tianchen Zhao , Pengfei Wan , Wenbo Ding , Wanli Ouyang , Xuefei Ning , Yu Wang

LiteAttention: A Temporal Sparse Attention for Diffusion Transformers

Diffusion Transformers, particularly for video generation, achieve remarkable quality but suffer from quadratic attention complexity, leading to prohibitive latency. Existing acceleration methods face a fundamental trade-off: dynamically…

Computer Vision and Pattern Recognition · Computer Science 2025-11-17 Dor Shmilovich , Tony Wu , Aviad Dahan , Yuval Domb

DiffiT: Diffusion Vision Transformers for Image Generation

Diffusion models with their powerful expressivity and high sample quality have achieved State-Of-The-Art (SOTA) performance in the generative domain. The pioneering Vision Transformer (ViT) has also demonstrated strong modeling capabilities…

Computer Vision and Pattern Recognition · Computer Science 2024-08-30 Ali Hatamizadeh , Jiaming Song , Guilin Liu , Jan Kautz , Arash Vahdat