Related papers: Temporal Aware Pruning for Efficient Diffusion-bas…

PARE: Pruning and Adaptive Routing for Efficient Video Generation

Video Diffusion Transformers (DiTs) generate high-quality videos but demand substantial compute due to wide blocks, deep architectures, and iterative sampling. Recent methods reduce cost by compressing width, depth, or sampling steps, but…

Computer Vision and Pattern Recognition · Computer Science 2026-05-27 Yutong Wang , Yunke Wang , Tianfan Xue , Yu Qiao , Yaohui Wang , Xinyuan Chen , Chang Xu

HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention

Diffusion-based video generation has advanced substantially in visual fidelity and temporal coherence, but practical deployment remains limited by the quadratic complexity of full attention. Training-free sparse attention is attractive…

Computer Vision and Pattern Recognition · Computer Science 2026-05-15 Xuzhe Zheng , Yuexiao Ma , Jing Xu , Xiawu Zheng , Rongrong Ji , Fei Chao

DAPE: Dual-Stage Parameter-Efficient Fine-Tuning for Consistent Video Editing with Diffusion Models

Video generation based on diffusion models presents a challenging multimodal task, with video editing emerging as a pivotal direction in this field. Recent video editing approaches primarily fall into two categories: training-required and…

Computer Vision and Pattern Recognition · Computer Science 2025-05-13 Junhao Xia , Chaoyang Zhang , Yecheng Zhang , Chengyang Zhou , Zhichang Wang , Bochun Liu , Dongshuo Yin

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing

Recent advancements in diffusion frameworks have significantly enhanced video editing, achieving high fidelity and strong alignment with textual prompts. However, conventional approaches using image diffusion models fall short in handling…

Computer Vision and Pattern Recognition · Computer Science 2025-06-09 Yixuan Zhu , Haolin Wang , Shilin Ma , Wenliang Zhao , Yansong Tang , Lei Chen , Jie Zhou

F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis

Recently Text-to-Video (T2V) synthesis has undergone a breakthrough by training transformers or diffusion models on large-scale datasets. Nevertheless, inferring such large models incurs huge costs.Previous inference acceleration works…

Computer Vision and Pattern Recognition · Computer Science 2024-03-29 Sitong Su , Jianzhi Liu , Lianli Gao , Jingkuan Song

Tango: Taming Visual Signals for Efficient Video Large Language Models

Token pruning has emerged as a mainstream approach for developing efficient Video Large Language Models (Video LLMs). This work revisits and advances the two predominant token-pruning paradigms: attention-based selection and…

Computer Vision and Pattern Recognition · Computer Science 2026-04-14 Shukang Yin , Sirui Zhao , Hanchao Wang , Baozhi Jia , Xianquan Wang , Chaoyou Fu , Enhong Chen

Veda: Scalable Video Diffusion via Distilled Sparse Attention

Scaling Diffusion Transformers to generate high-resolution, long videos is constrained by the quadratic cost of self-attention, and existing sparse attention methods degrade under high sparsity. We show empirically that generation quality…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Shihao Han , Hao Yang , Xinting Hu , Xiaofeng Mei , Yi Jiang , Xiaojuan Qi

Video Patch Pruning: Efficient Video Instance Segmentation via Early Token Reduction

Vision Transformers (ViTs) have demonstrated state-ofthe-art performance in several benchmarks, yet their high computational costs hinders their practical deployment. Patch Pruning offers significant savings, but existing approaches…

Computer Vision and Pattern Recognition · Computer Science 2026-04-02 Patrick Glandorf , Thomas Norrenbrock , Bodo Rosenhahn

TAPE: Temporal Attention-based Probabilistic human pose and shape Estimation

Reconstructing 3D human pose and shape from monocular videos is a well-studied but challenging problem. Common challenges include occlusions, the inherent ambiguities in the 2D to 3D mapping and the computational complexity of video…

Computer Vision and Pattern Recognition · Computer Science 2023-05-02 Nikolaos Vasilikopoulos , Nikos Kolotouros , Aggeliki Tsoli , Antonis Argyros

Token Pruning for In-Context Generation in Diffusion Transformers

In-context generation significantly enhances Diffusion Transformers (DiTs) by enabling controllable image-to-image generation through reference examples. However, the resulting input concatenation drastically increases sequence length,…

Computer Vision and Pattern Recognition · Computer Science 2026-02-03 Junqing Lin , Xingyu Zheng , Pei Cheng , Bin Fu , Jingwei Sun , Guangzhong Sun

Video Compression Meets Video Generation: Latent Inter-Frame Pruning with Attention Recovery

Current video generation models suffer from high computational latency, making real-time applications prohibitively costly. In this paper, we address this limitation by exploiting the temporal redundancy inherent in video latent patches. To…

Computer Vision and Pattern Recognition · Computer Science 2026-04-30 Dennis Menn , Yuedong Yang , Bokun Wang , Xiwen Wei , Mustafa Munir , Feng Liang , Radu Marculescu , Chenfeng Xu , Diana Marculescu

DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance

Diffusion transformer-based video generation models (DiTs) have recently attracted widespread attention for their excellent generation quality. However, their computational cost remains a major bottleneck-attention alone accounts for over…

Computer Vision and Pattern Recognition · Computer Science 2025-05-22 Xuan Shen , Chenxia Han , Yufa Zhou , Yanyue Xie , Yifan Gong , Quanyi Wang , Yiwei Wang , Yanzhi Wang , Pu Zhao , Jiuxiang Gu

Token Pruning for Caching Better: 9 Times Acceleration on Stable Diffusion for Free

Stable Diffusion has achieved remarkable success in the field of text-to-image generation, with its powerful generative capabilities and diverse generation results making a lasting impact. However, its iterative denoising introduces high…

Computer Vision and Pattern Recognition · Computer Science 2025-01-03 Evelyn Zhang , Bang Xiao , Jiayi Tang , Qianli Ma , Chang Zou , Xuefei Ning , Xuming Hu , Linfeng Zhang

TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model

Vision-Language Models (VLMs) demand substantial computational resources during inference, largely due to the extensive visual input tokens for representing visual information. Previous studies have noted that visual tokens tend to receive…

Computer Vision and Pattern Recognition · Computer Science 2025-04-01 Cheng Yang , Yang Sui , Jinqi Xiao , Lingyi Huang , Yu Gong , Chendi Li , Jinghua Yan , Yu Bai , Ponnuswamy Sadayappan , Xia Hu , Bo Yuan

Action-aware Dynamic Pruning for Efficient Vision-Language-Action Manipulation

Robotic manipulation with Vision-Language-Action models requires efficient inference over long-horizon multi-modal context, where attention to dense visual tokens dominates computational cost. Existing methods optimize inference speed by…

Robotics · Computer Science 2025-09-29 Xiaohuan Pei , Yuxing Chen , Siyu Xu , Yunke Wang , Yuheng Shi , Chang Xu

VidToMe: Video Token Merging for Zero-Shot Video Editing

Diffusion models have made significant advances in generating high-quality images, but their application to video generation has remained challenging due to the complexity of temporal motion. Zero-shot video editing offers a solution by…

Computer Vision and Pattern Recognition · Computer Science 2023-12-20 Xirui Li , Chao Ma , Xiaokang Yang , Ming-Hsuan Yang

SAFE-Pruner: Semantic Attention-Guided Future-Aware Token Pruning for Efficient Vision-Language-Action Manipulation

Real-time inference of vision-language-action (VLA) models is essential for robotic control. While visual token pruning has shown strong potential for accelerating inference, most existing methods mainly base pruning decisions on…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Shilin Ma , Chubin Zhang , Changyuan Wang , Yuji Wang , Yue Wu , Zixuan Wang , Jingqi Tian , Zheng Zhu , Yansong Tang

CAT Pruning: Cluster-Aware Token Pruning For Text-to-Image Diffusion Models

Diffusion models have revolutionized generative tasks, especially in the domain of text-to-image synthesis; however, their iterative denoising process demands substantial computational resources. In this paper, we present a novel…

Computer Vision and Pattern Recognition · Computer Science 2025-02-04 Xinle Cheng , Zhuoming Chen , Zhihao Jia

Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a…

Computer Vision and Pattern Recognition · Computer Science 2026-05-08 Shuo Yang , Haocheng Xi , Yilong Zhao , Muyang Li , Jintao Zhang , Han Cai , Yujun Lin , Xiuyu Li , Chenfeng Xu , Jianfei Chen , Song Han , Kurt Keutzer , Ion Stoica

Accelerating Text-to-Video Generation with Calibrated Sparse Attention

Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a…

Computer Vision and Pattern Recognition · Computer Science 2026-03-06 Shai Yehezkel , Shahar Yadin , Noam Elata , Yaron Ostrovsky-Berman , Bahjat Kawar