Related papers: Token Shift Transformer for Video Classification

TSM: Temporal Shift Module for Efficient Video Understanding

The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost. Conventional 2D CNNs are computationally cheap but cannot capture temporal relationships; 3D CNN…

Computer Vision and Pattern Recognition · Computer Science 2019-08-23 Ji Lin , Chuang Gan , Song Han

TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge Device

The explosive growth in video streaming requires video understanding at high accuracy and low computation cost. Conventional 2D CNNs are computationally cheap but cannot capture temporal relationships; 3D CNN-based methods can achieve good…

Computer Vision and Pattern Recognition · Computer Science 2021-09-28 Ji Lin , Chuang Gan , Kuan Wang , Song Han

TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval

Text-Video retrieval is a task of great practical value and has received increasing attention, among which learning spatial-temporal video representation is one of the research hotspots. The video encoders in the state-of-the-art video…

Computer Vision and Pattern Recognition · Computer Science 2022-07-19 Yuqi Liu , Pengfei Xiong , Luhui Xu , Shengming Cao , Qin Jin

Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition

Feature shifts have been shown to be useful for action recognition with CNN-based models since Temporal Shift Module (TSM) was proposed. It is based on frame-wise feature extraction with late fusion, and layer features are shifted along the…

Computer Vision and Pattern Recognition · Computer Science 2022-11-15 Ryota Hashiguchi , Toru Tamaki

Representation Shift: Unifying Token Compression with FlashAttention

Transformers have demonstrated remarkable success across vision, language, and video. Yet, increasing task complexity has led to larger models and more tokens, raising the quadratic cost of self-attention and the overhead of GPU memory…

Computer Vision and Pattern Recognition · Computer Science 2025-08-04 Joonmyung Choi , Sanghyeok Lee , Byungoh Ko , Eunseo Kim , Jihyung Kil , Hyunwoo J. Kim

Token-UNet: A New Case for Transformers Integration in Efficient and Interpretable 3D UNets for Brain Imaging Segmentation

We present Token-UNet, adopting the TokenLearner and TokenFuser modules to encase Transformers into UNets. While Transformers have enabled global interactions among input elements in medical imaging, current computational challenges hinder…

Computer Vision and Pattern Recognition · Computer Science 2026-02-24 Louis Fabrice Tshimanga , Andrea Zanola , Federico Del Pup , Manfredo Atzori

ViViT: A Video Vision Transformer

We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of…

Computer Vision and Pattern Recognition · Computer Science 2021-11-02 Anurag Arnab , Mostafa Dehghani , Georg Heigold , Chen Sun , Mario Lučić , Cordelia Schmid

Space-time Mixing Attention for Video Transformer

This paper is on video recognition using Transformers. Very recent attempts in this area have demonstrated promising results in terms of recognition accuracy, yet they have been also shown to induce, in many cases, significant computational…

Computer Vision and Pattern Recognition · Computer Science 2021-06-14 Adrian Bulat , Juan-Manuel Perez-Rua , Swathikiran Sudhakaran , Brais Martinez , Georgios Tzimiropoulos

Temporally Efficient Vision Transformer for Video Instance Segmentation

Recently vision transformer has achieved tremendous success on image-level visual recognition tasks. To effectively and efficiently model the crucial temporal information within a video clip, we propose a Temporally Efficient Vision…

Computer Vision and Pattern Recognition · Computer Science 2022-04-19 Shusheng Yang , Xinggang Wang , Yu Li , Yuxin Fang , Jiemin Fang , Wenyu Liu , Xun Zhao , Ying Shan

Is Space-Time Attention All You Need for Video Understanding?

We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named "TimeSformer," adapts the standard Transformer architecture to video by enabling spatiotemporal…

Computer Vision and Pattern Recognition · Computer Science 2021-06-10 Gedas Bertasius , Heng Wang , Lorenzo Torresani

TVLT: Textless Vision-Language Transformer

In this work, we present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs for vision-and-language representation learning with minimal modality-specific design, and do…

Computer Vision and Pattern Recognition · Computer Science 2022-11-03 Zineng Tang , Jaemin Cho , Yixin Nie , Mohit Bansal

Zero-Shot Video Translation via Token Warping

With the revolution of generative AI, video-related tasks have been widely studied. However, current state-of-the-art video models still lag behind image models in visual quality and user control over generated content. In this paper, we…

Computer Vision and Pattern Recognition · Computer Science 2025-11-26 Haiming Zhu , Yangyang Xu , Jun Yu , Shengfeng He

TCFormer: Visual Recognition via Token Clustering Transformer

Transformers are widely used in computer vision areas and have achieved remarkable success. Most state-of-the-art approaches split images into regular grids and represent each grid region with a vision token. However, fixed token…

Computer Vision and Pattern Recognition · Computer Science 2024-07-17 Wang Zeng , Sheng Jin , Lumin Xu , Wentao Liu , Chen Qian , Wanli Ouyang , Ping Luo , Xiaogang Wang

Towards Robust Video Instance Segmentation with Temporal-Aware Transformer

Most existing transformer based video instance segmentation methods extract per frame features independently, hence it is challenging to solve the appearance deformation problem. In this paper, we observe the temporal information is…

Computer Vision and Pattern Recognition · Computer Science 2023-01-24 Zhenghao Zhang , Fangtao Shao , Zuozhuo Dai , Siyu Zhu

TCSAFormer: Efficient Vision Transformer with Token Compression and Sparse Attention for Medical Image Segmentation

In recent years, transformer-based methods have achieved remarkable progress in medical image segmentation due to their superior ability to capture long-range dependencies. However, these methods typically suffer from two major limitations.…

Computer Vision and Pattern Recognition · Computer Science 2025-08-07 Zunhui Xia , Hongxing Li , Libin Lan

Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

Vision transformers have achieved great successes in many computer vision tasks. Most methods generate vision tokens by splitting an image into a regular and fixed grid and treating each cell as a token. However, not all regions are equally…

Computer Vision and Pattern Recognition · Computer Science 2022-04-22 Wang Zeng , Sheng Jin , Wentao Liu , Chen Qian , Ping Luo , Wanli Ouyang , Xiaogang Wang

End-to-End Semantic Video Transformer for Zero-Shot Action Recognition

While video action recognition has been an active area of research for several years, zero-shot action recognition has only recently started gaining traction. In this work, we propose a novel end-to-end trained transformer model which is…

Computer Vision and Pattern Recognition · Computer Science 2022-12-05 Keval Doshi , Yasin Yilmaz

Visual Transformers: Token-based Image Representation and Processing for Computer Vision

Computer vision has achieved remarkable success by (a) representing images as uniformly-arranged pixel arrays and (b) convolving highly-localized features. However, convolutions treat all image pixels equally regardless of importance;…

Computer Vision and Pattern Recognition · Computer Science 2020-11-23 Bichen Wu , Chenfeng Xu , Xiaoliang Dai , Alvin Wan , Peizhao Zhang , Zhicheng Yan , Masayoshi Tomizuka , Joseph Gonzalez , Kurt Keutzer , Peter Vajda

Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection in Autonomous Driving

The strong demand of autonomous driving in the industry has lead to strong interest in 3D object detection and resulted in many excellent 3D object detection algorithms. However, the vast majority of algorithms only model single-frame data,…

Computer Vision and Pattern Recognition · Computer Science 2020-11-30 Zhenxun Yuan , Xiao Song , Lei Bai , Wengang Zhou , Zhe Wang , Wanli Ouyang

UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning

It is a challenging task to learn rich and multi-scale spatiotemporal semantics from high-dimensional videos, due to large local redundancy and complex global dependency between video frames. The recent advances in this research have been…

Computer Vision and Pattern Recognition · Computer Science 2022-02-09 Kunchang Li , Yali Wang , Peng Gao , Guanglu Song , Yu Liu , Hongsheng Li , Yu Qiao