Related papers: Efficient Video Transformers with Spatial-Temporal…

Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the…

Computer Vision and Pattern Recognition · Computer Science 2026-03-19 Jianrui Zhang , Yue Yang , Rohun Tripathi , Winson Han , Ranjay Krishna , Christopher Clark , Yong Jae Lee , Sangho Lee

TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval

Text-Video retrieval is a task of great practical value and has received increasing attention, among which learning spatial-temporal video representation is one of the research hotspots. The video encoders in the state-of-the-art video…

Computer Vision and Pattern Recognition · Computer Science 2022-07-19 Yuqi Liu , Pengfei Xiong , Luhui Xu , Shengming Cao , Qin Jin

Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation

Transformers have become the primary backbone of the computer vision community due to their impressive performance. However, the unfriendly computation cost impedes their potential in the video recognition domain. To optimize the…

Computer Vision and Pattern Recognition · Computer Science 2023-08-10 Shuangrui Ding , Peisen Zhao , Xiaopeng Zhang , Rui Qian , Hongkai Xiong , Qi Tian

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free…

Computer Vision and Pattern Recognition · Computer Science 2025-07-11 Jeongseok Hyun , Sukjun Hwang , Su Ho Han , Taeoh Kim , Inwoong Lee , Dongyoon Wee , Joon-Young Lee , Seon Joo Kim , Minho Shim

PSTTS: A Plug-and-Play Token Selector for Efficient Event-based Spatio-temporal Representation Learning

Mainstream event-based spatio-temporal representation learning methods typically process event streams by converting them into sequences of event frames, achieving remarkable performance. However, they neglect the high spatial sparsity and…

Computer Vision and Pattern Recognition · Computer Science 2025-09-29 Xiangmo Zhao , Nan Yang , Yang Wang , Zhanwen Liu

Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed-forward manner. However, their computational cost grows quadratically with the…

Computer Vision and Pattern Recognition · Computer Science 2026-05-25 Shuhong Zheng , Michael Oechsle , Erik Sandström , Marie-Julie Rakotosaona , Federico Tombari , Igor Gilitschenski

ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

Multimodal Large Language Models (MLLMs) face significant computational overhead when processing long videos due to the massive number of visual tokens required. To improve efficiency, existing methods primarily reduce redundancy by pruning…

Artificial Intelligence · Computer Science 2026-05-22 Bingjun Luo , Tony Wang , Chaoqi Chen , Xinpeng Ding

Where Do Tokens Go? Understanding Pruning Behaviors in STEP at High Resolutions

Vision Transformers (ViTs) achieve state-of-the-art performance in semantic segmentation but are hindered by high computational and memory costs. To address this, we propose STEP (SuperToken and Early-Pruning), a hybrid token-reduction…

Computer Vision and Pattern Recognition · Computer Science 2026-05-21 Michal Szczepanski , Martyna Poreba , Karim Haroun

Static or Dynamic: Towards Query-Adaptive Token Selection for Video Question Answering

Video question answering benefits from the rich information in videos, enabling various applications. However, the large volume of tokens generated from long videos presents challenges to memory efficiency and model performance. To…

Computer Vision and Pattern Recognition · Computer Science 2025-09-16 Yumeng Shi , Quanyu Long , Wenya Wang

ViViT: A Video Vision Transformer

We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of…

Computer Vision and Pattern Recognition · Computer Science 2021-11-02 Anurag Arnab , Mostafa Dehghani , Georg Heigold , Chen Sun , Mario Lučić , Cordelia Schmid

Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers

This paper introduces Content-aware Token Sharing (CTS), a token reduction approach that improves the computational efficiency of semantic segmentation networks that use Vision Transformers (ViTs). Existing works have proposed token…

Computer Vision and Pattern Recognition · Computer Science 2023-06-06 Chenyang Lu , Daan de Geus , Gijs Dubbelman

Making Vision Transformers Efficient from A Token Sparsification View

The quadratic computational complexity to the number of tokens limits the practical applications of Vision Transformers (ViTs). Several works propose to prune redundant tokens to achieve efficient ViTs. However, these methods generally…

Computer Vision and Pattern Recognition · Computer Science 2023-03-31 Shuning Chang , Pichao Wang , Ming Lin , Fan Wang , David Junhao Zhang , Rong Jin , Mike Zheng Shou

STH: Spatio-Temporal Hybrid Convolution for Efficient Action Recognition

Effective and Efficient spatio-temporal modeling is essential for action recognition. Existing methods suffer from the trade-off between model performance and model complexity. In this paper, we present a novel Spatio-Temporal Hybrid…

Computer Vision and Pattern Recognition · Computer Science 2020-03-19 Xu Li , Jingwen Wang , Lin Ma , Kaihao Zhang , Fengzong Lian , Zhanhui Kang , Jinjun Wang

Vision Transformer with Super Token Sampling

Vision transformer has achieved impressive performance for many vision tasks. However, it may suffer from high redundancy in capturing local features for shallow layers. Local self-attention or early-stage convolutions are thus utilized,…

Computer Vision and Pattern Recognition · Computer Science 2024-01-26 Huaibo Huang , Xiaoqiang Zhou , Jie Cao , Ran He , Tieniu Tan

Spatio-Temporal Ranked-Attention Networks for Video Captioning

Generating video descriptions automatically is a challenging task that involves a complex interplay between spatio-temporal visual features and language models. Given that videos consist of spatial (frame-level) features and their temporal…

Computer Vision and Pattern Recognition · Computer Science 2020-01-20 Anoop Cherian , Jue Wang , Chiori Hori , Tim K. Marks

Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference

Vision-language models (VLMs) have recently expanded from static image understanding to video reasoning, but their scalability is fundamentally limited by the quadratic cost of processing dense frame sequences. Long videos often exceed the…

Computer Vision and Pattern Recognition · Computer Science 2025-10-17 Natan Bagrov , Eugene Khvedchenia , Borys Tymchenko , Shay Aharon , Lior Kadoch , Tomer Keren , Ofri Masad , Yonatan Geifman , Ran Zilberstein , Tuomas Rintamaki , Matthieu Le , Andrew Tao

Exploring Reliable Spatiotemporal Dependencies for Efficient Visual Tracking

Recent advances in transformer-based lightweight object tracking have established new standards across benchmarks, leveraging the global receptive field and powerful feature extraction capabilities of attention mechanisms. Despite these…

Computer Vision and Pattern Recognition · Computer Science 2026-01-15 Junze Shi , Yang Yu , Jian Shi , Haibo Luo

Real-time Online Video Detection with Temporal Smoothing Transformers

Streaming video recognition reasons about objects and their actions in every frame of a video. A good streaming recognition model captures both long-term dynamics and short-term changes of video. Unfortunately, in most existing methods, the…

Computer Vision and Pattern Recognition · Computer Science 2022-09-20 Yue Zhao , Philipp Krähenbühl

The Devil is in Temporal Token: High Quality Video Reasoning Segmentation

Existing methods for Video Reasoning Segmentation rely heavily on a single special token to represent the object in the keyframe or the entire video, inadequately capturing spatial complexity and inter-frame motion. To overcome these…

Computer Vision and Pattern Recognition · Computer Science 2025-03-03 Sitong Gong , Yunzhi Zhuge , Lu Zhang , Zongxin Yang , Pingping Zhang , Huchuan Lu

STSM: Spatio-Temporal Shift Module for Efficient Action Recognition

The modeling, computational cost, and accuracy of traditional Spatio-temporal networks are the three most concentrated research topics in video action recognition. The traditional 2D convolution has a low computational cost, but it cannot…

Computer Vision and Pattern Recognition · Computer Science 2021-12-07 Zhaoqilin Yang , Gaoyun An