English
Related papers

Related papers: Efficient Video Transformers with Spatial-Temporal…

200 papers

Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the…

Computer Vision and Pattern Recognition · Computer Science 2026-03-19 Jianrui Zhang , Yue Yang , Rohun Tripathi , Winson Han , Ranjay Krishna , Christopher Clark , Yong Jae Lee , Sangho Lee

Text-Video retrieval is a task of great practical value and has received increasing attention, among which learning spatial-temporal video representation is one of the research hotspots. The video encoders in the state-of-the-art video…

Computer Vision and Pattern Recognition · Computer Science 2022-07-19 Yuqi Liu , Pengfei Xiong , Luhui Xu , Shengming Cao , Qin Jin

Transformers have become the primary backbone of the computer vision community due to their impressive performance. However, the unfriendly computation cost impedes their potential in the video recognition domain. To optimize the…

Computer Vision and Pattern Recognition · Computer Science 2023-08-10 Shuangrui Ding , Peisen Zhao , Xiaopeng Zhang , Rui Qian , Hongkai Xiong , Qi Tian

Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free…

Computer Vision and Pattern Recognition · Computer Science 2025-07-11 Jeongseok Hyun , Sukjun Hwang , Su Ho Han , Taeoh Kim , Inwoong Lee , Dongyoon Wee , Joon-Young Lee , Seon Joo Kim , Minho Shim

Mainstream event-based spatio-temporal representation learning methods typically process event streams by converting them into sequences of event frames, achieving remarkable performance. However, they neglect the high spatial sparsity and…

Computer Vision and Pattern Recognition · Computer Science 2025-09-29 Xiangmo Zhao , Nan Yang , Yang Wang , Zhanwen Liu

Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed-forward manner. However, their computational cost grows quadratically with the…

Computer Vision and Pattern Recognition · Computer Science 2026-05-25 Shuhong Zheng , Michael Oechsle , Erik Sandström , Marie-Julie Rakotosaona , Federico Tombari , Igor Gilitschenski

Multimodal Large Language Models (MLLMs) face significant computational overhead when processing long videos due to the massive number of visual tokens required. To improve efficiency, existing methods primarily reduce redundancy by pruning…

Artificial Intelligence · Computer Science 2026-05-22 Bingjun Luo , Tony Wang , Chaoqi Chen , Xinpeng Ding

Vision Transformers (ViTs) achieve state-of-the-art performance in semantic segmentation but are hindered by high computational and memory costs. To address this, we propose STEP (SuperToken and Early-Pruning), a hybrid token-reduction…

Computer Vision and Pattern Recognition · Computer Science 2026-05-21 Michal Szczepanski , Martyna Poreba , Karim Haroun

Video question answering benefits from the rich information in videos, enabling various applications. However, the large volume of tokens generated from long videos presents challenges to memory efficiency and model performance. To…

Computer Vision and Pattern Recognition · Computer Science 2025-09-16 Yumeng Shi , Quanyu Long , Wenya Wang

We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of…

Computer Vision and Pattern Recognition · Computer Science 2021-11-02 Anurag Arnab , Mostafa Dehghani , Georg Heigold , Chen Sun , Mario Lučić , Cordelia Schmid

This paper introduces Content-aware Token Sharing (CTS), a token reduction approach that improves the computational efficiency of semantic segmentation networks that use Vision Transformers (ViTs). Existing works have proposed token…

Computer Vision and Pattern Recognition · Computer Science 2023-06-06 Chenyang Lu , Daan de Geus , Gijs Dubbelman

The quadratic computational complexity to the number of tokens limits the practical applications of Vision Transformers (ViTs). Several works propose to prune redundant tokens to achieve efficient ViTs. However, these methods generally…

Computer Vision and Pattern Recognition · Computer Science 2023-03-31 Shuning Chang , Pichao Wang , Ming Lin , Fan Wang , David Junhao Zhang , Rong Jin , Mike Zheng Shou

Effective and Efficient spatio-temporal modeling is essential for action recognition. Existing methods suffer from the trade-off between model performance and model complexity. In this paper, we present a novel Spatio-Temporal Hybrid…

Computer Vision and Pattern Recognition · Computer Science 2020-03-19 Xu Li , Jingwen Wang , Lin Ma , Kaihao Zhang , Fengzong Lian , Zhanhui Kang , Jinjun Wang

Vision transformer has achieved impressive performance for many vision tasks. However, it may suffer from high redundancy in capturing local features for shallow layers. Local self-attention or early-stage convolutions are thus utilized,…

Computer Vision and Pattern Recognition · Computer Science 2024-01-26 Huaibo Huang , Xiaoqiang Zhou , Jie Cao , Ran He , Tieniu Tan

Generating video descriptions automatically is a challenging task that involves a complex interplay between spatio-temporal visual features and language models. Given that videos consist of spatial (frame-level) features and their temporal…

Computer Vision and Pattern Recognition · Computer Science 2020-01-20 Anoop Cherian , Jue Wang , Chiori Hori , Tim K. Marks

Vision-language models (VLMs) have recently expanded from static image understanding to video reasoning, but their scalability is fundamentally limited by the quadratic cost of processing dense frame sequences. Long videos often exceed the…

Computer Vision and Pattern Recognition · Computer Science 2025-10-17 Natan Bagrov , Eugene Khvedchenia , Borys Tymchenko , Shay Aharon , Lior Kadoch , Tomer Keren , Ofri Masad , Yonatan Geifman , Ran Zilberstein , Tuomas Rintamaki , Matthieu Le , Andrew Tao

Recent advances in transformer-based lightweight object tracking have established new standards across benchmarks, leveraging the global receptive field and powerful feature extraction capabilities of attention mechanisms. Despite these…

Computer Vision and Pattern Recognition · Computer Science 2026-01-15 Junze Shi , Yang Yu , Jian Shi , Haibo Luo

Streaming video recognition reasons about objects and their actions in every frame of a video. A good streaming recognition model captures both long-term dynamics and short-term changes of video. Unfortunately, in most existing methods, the…

Computer Vision and Pattern Recognition · Computer Science 2022-09-20 Yue Zhao , Philipp Krähenbühl

Existing methods for Video Reasoning Segmentation rely heavily on a single special token to represent the object in the keyframe or the entire video, inadequately capturing spatial complexity and inter-frame motion. To overcome these…

Computer Vision and Pattern Recognition · Computer Science 2025-03-03 Sitong Gong , Yunzhi Zhuge , Lu Zhang , Zongxin Yang , Pingping Zhang , Huchuan Lu

The modeling, computational cost, and accuracy of traditional Spatio-temporal networks are the three most concentrated research topics in video action recognition. The traditional 2D convolution has a low computational cost, but it cannot…

Computer Vision and Pattern Recognition · Computer Science 2021-12-07 Zhaoqilin Yang , Gaoyun An
‹ Prev 1 2 3 10 Next ›