Related papers: DDiT: Dynamic Patch Scheduling for Efficient Diffu…

DC-DiT: Adaptive Compute and Elastic Inference for Visual Generation via Dynamic Chunking

Diffusion Transformers rely on static patchify tokenization, assigning the same token budget to smooth backgrounds, detailed object regions, noisy early timesteps, and late-stage refinements. We introduce the Dynamic Chunking Diffusion…

Computer Vision and Pattern Recognition · Computer Science 2026-05-08 Akash Haridas , Utkarsh Saxena , Parsa Ashrafi Fashi , Mehdi Rezagholizadeh , Vikram Appia , Emad Barsoum

Dynamic Diffusion Transformer

Diffusion Transformer (DiT), an emerging diffusion model for image generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs stem from the static inference…

Computer Vision and Pattern Recognition · Computer Science 2024-10-10 Wangbo Zhao , Yizeng Han , Jiasheng Tang , Kai Wang , Yibing Song , Gao Huang , Fan Wang , Yang You

Frequency-Aware Error-Bounded Caching for Accelerating Diffusion Transformers

Diffusion Transformers (DiTs) have emerged as the dominant architecture for high-quality image and video generation, yet their iterative denoising process incurs substantial computational cost during inference. Existing caching methods…

Computer Vision and Pattern Recognition · Computer Science 2026-03-06 Guandong Li

DyDiT++: Diffusion Transformers with Timestep and Spatial Dynamics for Efficient Visual Generation

Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the…

Computer Vision and Pattern Recognition · Computer Science 2026-01-15 Wangbo Zhao , Yizeng Han , Jiasheng Tang , Kai Wang , Hao Luo , Yibing Song , Gao Huang , Fan Wang , Yang You

Pyramidal Patchification Flow for Visual Generation

Diffusion transformers (DiTs) adopt Patchify, mapping patch representations to token representations through linear projections, to adjust the number of tokens input to DiT blocks and thus the computation cost. Instead of a single patch…

Computer Vision and Pattern Recognition · Computer Science 2026-03-13 Hui Li , Baoyou Chen , Liwei Zhang , Jiaye Li , Jingdong Wang , Siyu Zhu

Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers

Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage…

Computer Vision and Pattern Recognition · Computer Science 2025-05-29 Weilun Feng , Chuanguang Yang , Haotong Qin , Xiangqi Li , Yu Wang , Zhulin An , Libo Huang , Boyu Diao , Zixiang Zhao , Yongjun Xu , Michele Magno

Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep

Diffusion-based video editing has emerged as an important paradigm for high-quality and flexible content generation. However, despite their generality and strong modeling capacity, Diffusion Transformers (DiT) remain computationally…

Computer Vision and Pattern Recognition · Computer Science 2026-03-26 Tianyi Liu , Ye Lu , Linfeng Zhang , Chen Cai , Jianjun Gao , Yi Wang , Kim-Hui Yap , Lap-Pui Chau

Rethinking Token-wise Feature Caching: Accelerating Diffusion Transformers with Dual Feature Caching

Diffusion Transformers (DiT) have become the dominant methods in image and video generation yet still suffer substantial computational costs. As an effective approach for DiT acceleration, feature caching methods are designed to cache the…

Machine Learning · Computer Science 2025-11-19 Chang Zou , Evelyn Zhang , Runlin Guo , Haohang Xu , Conghui He , Xuming Hu , Linfeng Zhang

Adaptive Caching for Faster Video Generation with Diffusion Transformers

Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs) -- despite making significant headway in this context -- have only…

Computer Vision and Pattern Recognition · Computer Science 2024-11-08 Kumara Kahatapitiya , Haozhe Liu , Sen He , Ding Liu , Menglin Jia , Chenyang Zhang , Michael S. Ryoo , Tian Xie

Predict to Skip: Linear Multistep Feature Forecasting for Efficient Diffusion Transformers

Diffusion Transformers (DiT) have emerged as a widely adopted backbone for high-fidelity image and video generation, yet their iterative denoising process incurs high computational costs. Existing training-free acceleration methods rely on…

Computer Vision and Pattern Recognition · Computer Science 2026-02-23 Hanshuai Cui , Zhiqing Tang , Qianli Ma , Zhi Yao , Weijia Jia

AdaCorrection: Adaptive Offset Cache Correction for Accurate Diffusion Transformers

Diffusion Transformers (DiTs) achieve state-of-the-art performance in high-fidelity image and video generation but suffer from expensive inference due to their iterative denoising structure. While prior methods accelerate sampling by…

Computer Vision and Pattern Recognition · Computer Science 2026-05-11 Dong Liu , Yanxuan Yu , Ben Lengerich , Ying Nian Wu

Accelerating Diffusion Transformers with Token-wise Feature Caching

Diffusion transformers have shown significant effectiveness in both image and video synthesis at the expense of huge computation costs. To address this problem, feature caching methods have been introduced to accelerate diffusion…

Machine Learning · Computer Science 2025-02-20 Chang Zou , Xuyang Liu , Ting Liu , Siteng Huang , Linfeng Zhang

SparseDiT: Token Sparsification for Efficient Diffusion Transformer

Diffusion Transformers (DiT) are renowned for their impressive generative performance; however, they are significantly constrained by considerable computational costs due to the quadratic complexity in self-attention and the extensive…

Computer Vision and Pattern Recognition · Computer Science 2025-09-24 Shuning Chang , Pichao Wang , Jiasheng Tang , Fan Wang , Yi Yang

ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation

Diffusion transformers have demonstrated remarkable performance in visual generation tasks, such as generating realistic images or videos based on textual instructions. However, larger model sizes and multi-frame processing for video…

Computer Vision and Pattern Recognition · Computer Science 2025-02-25 Tianchen Zhao , Tongcheng Fang , Haofeng Huang , Enshu Liu , Rui Wan , Widyadewi Soedarmadji , Shiyao Li , Zinan Lin , Guohao Dai , Shengen Yan , Huazhong Yang , Xuefei Ning , Yu Wang

ProCache: Constraint-Aware Feature Caching with Selective Computation for Diffusion Transformer Acceleration

Diffusion Transformers (DiTs) have achieved state-of-the-art performance in generative modeling, yet their high computational cost hinders real-time deployment. While feature caching offers a promising training-free acceleration solution by…

Computer Vision and Pattern Recognition · Computer Science 2026-02-16 Fanpu Cao , Yaofo Chen , Zeng You , Wei Luo

FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute

Despite their remarkable performance, modern Diffusion Transformers are hindered by substantial resource requirements during inference, stemming from the fixed and large amount of compute needed for each denoising step. In this work, we…

Machine Learning · Computer Science 2025-02-28 Sotiris Anagnostidis , Gregor Bachmann , Yeongmin Kim , Jonas Kohler , Markos Georgopoulos , Artsiom Sanakoyeu , Yuming Du , Albert Pumarola , Ali Thabet , Edgar Schönfeld

Memory-Efficient Fine-Tuning Diffusion Transformers via Dynamic Patch Sampling and Block Skipping

Diffusion Transformers (DiTs) have significantly enhanced text-to-image (T2I) generation quality, enabling high-quality personalized content creation. However, fine-tuning these models requires substantial computational complexity and…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Sunghyun Park , Jeongho Kim , Hyoungwoo Park , Debasmit Das , Sungrack Yun , Munawar Hayat , Jaegul Choo , Fatih Porikli , Seokeon Choi

CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers

Diffusion Transformers (DiTs) deliver remarkable image and video generation quality but incur high computational cost, limiting scalability and on-device deployment. We introduce CoReDiT, a structured token pruning framework for DiTs across…

Computer Vision and Pattern Recognition · Computer Science 2026-05-15 Zhuojin Li , Hsin-Pai Cheng , Hong Cai , Shizhong Han , Fatih Porikli

Mixture of Distributions Matters: Dynamic Sparse Attention for Efficient Video Diffusion Transformers

While Diffusion Transformers (DiTs) have achieved notable progress in video generation, this long-sequence generation task remains constrained by the quadratic complexity inherent to self-attention mechanisms, creating significant barriers…

Computer Vision and Pattern Recognition · Computer Science 2026-02-04 Yuxi Liu , Yipeng Hu , Zekun Zhang , Kunze Jiang , Kun Yuan

Temporal Dynamic Quantization for Diffusion Models

The diffusion model has gained popularity in vision applications due to its remarkable generative performance and versatility. However, high storage and computation demands, resulting from the model size and iterative generation, hinder its…

Computer Vision and Pattern Recognition · Computer Science 2023-12-12 Junhyuk So , Jungwon Lee , Daehyun Ahn , Hyungjun Kim , Eunhyeok Park