English
Related papers

Related papers: Temporal Aware Pruning for Efficient Diffusion-bas…

200 papers

Video Diffusion Transformers (DiTs) generate high-quality videos but demand substantial compute due to wide blocks, deep architectures, and iterative sampling. Recent methods reduce cost by compressing width, depth, or sampling steps, but…

Computer Vision and Pattern Recognition · Computer Science 2026-05-27 Yutong Wang , Yunke Wang , Tianfan Xue , Yu Qiao , Yaohui Wang , Xinyuan Chen , Chang Xu

Diffusion-based video generation has advanced substantially in visual fidelity and temporal coherence, but practical deployment remains limited by the quadratic complexity of full attention. Training-free sparse attention is attractive…

Computer Vision and Pattern Recognition · Computer Science 2026-05-15 Xuzhe Zheng , Yuexiao Ma , Jing Xu , Xiawu Zheng , Rongrong Ji , Fei Chao

Video generation based on diffusion models presents a challenging multimodal task, with video editing emerging as a pivotal direction in this field. Recent video editing approaches primarily fall into two categories: training-required and…

Computer Vision and Pattern Recognition · Computer Science 2025-05-13 Junhao Xia , Chaoyang Zhang , Yecheng Zhang , Chengyang Zhou , Zhichang Wang , Bochun Liu , Dongshuo Yin

Recent advancements in diffusion frameworks have significantly enhanced video editing, achieving high fidelity and strong alignment with textual prompts. However, conventional approaches using image diffusion models fall short in handling…

Computer Vision and Pattern Recognition · Computer Science 2025-06-09 Yixuan Zhu , Haolin Wang , Shilin Ma , Wenliang Zhao , Yansong Tang , Lei Chen , Jie Zhou

Recently Text-to-Video (T2V) synthesis has undergone a breakthrough by training transformers or diffusion models on large-scale datasets. Nevertheless, inferring such large models incurs huge costs.Previous inference acceleration works…

Computer Vision and Pattern Recognition · Computer Science 2024-03-29 Sitong Su , Jianzhi Liu , Lianli Gao , Jingkuan Song

Token pruning has emerged as a mainstream approach for developing efficient Video Large Language Models (Video LLMs). This work revisits and advances the two predominant token-pruning paradigms: attention-based selection and…

Computer Vision and Pattern Recognition · Computer Science 2026-04-14 Shukang Yin , Sirui Zhao , Hanchao Wang , Baozhi Jia , Xianquan Wang , Chaoyou Fu , Enhong Chen

Scaling Diffusion Transformers to generate high-resolution, long videos is constrained by the quadratic cost of self-attention, and existing sparse attention methods degrade under high sparsity. We show empirically that generation quality…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Shihao Han , Hao Yang , Xinting Hu , Xiaofeng Mei , Yi Jiang , Xiaojuan Qi

Vision Transformers (ViTs) have demonstrated state-ofthe-art performance in several benchmarks, yet their high computational costs hinders their practical deployment. Patch Pruning offers significant savings, but existing approaches…

Computer Vision and Pattern Recognition · Computer Science 2026-04-02 Patrick Glandorf , Thomas Norrenbrock , Bodo Rosenhahn

Reconstructing 3D human pose and shape from monocular videos is a well-studied but challenging problem. Common challenges include occlusions, the inherent ambiguities in the 2D to 3D mapping and the computational complexity of video…

Computer Vision and Pattern Recognition · Computer Science 2023-05-02 Nikolaos Vasilikopoulos , Nikos Kolotouros , Aggeliki Tsoli , Antonis Argyros

In-context generation significantly enhances Diffusion Transformers (DiTs) by enabling controllable image-to-image generation through reference examples. However, the resulting input concatenation drastically increases sequence length,…

Computer Vision and Pattern Recognition · Computer Science 2026-02-03 Junqing Lin , Xingyu Zheng , Pei Cheng , Bin Fu , Jingwei Sun , Guangzhong Sun

Current video generation models suffer from high computational latency, making real-time applications prohibitively costly. In this paper, we address this limitation by exploiting the temporal redundancy inherent in video latent patches. To…

Computer Vision and Pattern Recognition · Computer Science 2026-04-30 Dennis Menn , Yuedong Yang , Bokun Wang , Xiwen Wei , Mustafa Munir , Feng Liang , Radu Marculescu , Chenfeng Xu , Diana Marculescu

Diffusion transformer-based video generation models (DiTs) have recently attracted widespread attention for their excellent generation quality. However, their computational cost remains a major bottleneck-attention alone accounts for over…

Computer Vision and Pattern Recognition · Computer Science 2025-05-22 Xuan Shen , Chenxia Han , Yufa Zhou , Yanyue Xie , Yifan Gong , Quanyi Wang , Yiwei Wang , Yanzhi Wang , Pu Zhao , Jiuxiang Gu

Stable Diffusion has achieved remarkable success in the field of text-to-image generation, with its powerful generative capabilities and diverse generation results making a lasting impact. However, its iterative denoising introduces high…

Computer Vision and Pattern Recognition · Computer Science 2025-01-03 Evelyn Zhang , Bang Xiao , Jiayi Tang , Qianli Ma , Chang Zou , Xuefei Ning , Xuming Hu , Linfeng Zhang

Vision-Language Models (VLMs) demand substantial computational resources during inference, largely due to the extensive visual input tokens for representing visual information. Previous studies have noted that visual tokens tend to receive…

Computer Vision and Pattern Recognition · Computer Science 2025-04-01 Cheng Yang , Yang Sui , Jinqi Xiao , Lingyi Huang , Yu Gong , Chendi Li , Jinghua Yan , Yu Bai , Ponnuswamy Sadayappan , Xia Hu , Bo Yuan

Robotic manipulation with Vision-Language-Action models requires efficient inference over long-horizon multi-modal context, where attention to dense visual tokens dominates computational cost. Existing methods optimize inference speed by…

Robotics · Computer Science 2025-09-29 Xiaohuan Pei , Yuxing Chen , Siyu Xu , Yunke Wang , Yuheng Shi , Chang Xu

Diffusion models have made significant advances in generating high-quality images, but their application to video generation has remained challenging due to the complexity of temporal motion. Zero-shot video editing offers a solution by…

Computer Vision and Pattern Recognition · Computer Science 2023-12-20 Xirui Li , Chao Ma , Xiaokang Yang , Ming-Hsuan Yang

Real-time inference of vision-language-action (VLA) models is essential for robotic control. While visual token pruning has shown strong potential for accelerating inference, most existing methods mainly base pruning decisions on…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Shilin Ma , Chubin Zhang , Changyuan Wang , Yuji Wang , Yue Wu , Zixuan Wang , Jingqi Tian , Zheng Zhu , Yansong Tang

Diffusion models have revolutionized generative tasks, especially in the domain of text-to-image synthesis; however, their iterative denoising process demands substantial computational resources. In this paper, we present a novel…

Computer Vision and Pattern Recognition · Computer Science 2025-02-04 Xinle Cheng , Zhuoming Chen , Zhihao Jia

Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a…

Computer Vision and Pattern Recognition · Computer Science 2026-05-08 Shuo Yang , Haocheng Xi , Yilong Zhao , Muyang Li , Jintao Zhang , Han Cai , Yujun Lin , Xiuyu Li , Chenfeng Xu , Jianfei Chen , Song Han , Kurt Keutzer , Ion Stoica

Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a…

Computer Vision and Pattern Recognition · Computer Science 2026-03-06 Shai Yehezkel , Shahar Yadin , Noam Elata , Yaron Ostrovsky-Berman , Bahjat Kawar
‹ Prev 1 2 3 10 Next ›