English
Related papers

Related papers: Token Shift Transformer for Video Classification

200 papers

The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost. Conventional 2D CNNs are computationally cheap but cannot capture temporal relationships; 3D CNN…

Computer Vision and Pattern Recognition · Computer Science 2019-08-23 Ji Lin , Chuang Gan , Song Han

The explosive growth in video streaming requires video understanding at high accuracy and low computation cost. Conventional 2D CNNs are computationally cheap but cannot capture temporal relationships; 3D CNN-based methods can achieve good…

Computer Vision and Pattern Recognition · Computer Science 2021-09-28 Ji Lin , Chuang Gan , Kuan Wang , Song Han

Text-Video retrieval is a task of great practical value and has received increasing attention, among which learning spatial-temporal video representation is one of the research hotspots. The video encoders in the state-of-the-art video…

Computer Vision and Pattern Recognition · Computer Science 2022-07-19 Yuqi Liu , Pengfei Xiong , Luhui Xu , Shengming Cao , Qin Jin

Feature shifts have been shown to be useful for action recognition with CNN-based models since Temporal Shift Module (TSM) was proposed. It is based on frame-wise feature extraction with late fusion, and layer features are shifted along the…

Computer Vision and Pattern Recognition · Computer Science 2022-11-15 Ryota Hashiguchi , Toru Tamaki

Transformers have demonstrated remarkable success across vision, language, and video. Yet, increasing task complexity has led to larger models and more tokens, raising the quadratic cost of self-attention and the overhead of GPU memory…

Computer Vision and Pattern Recognition · Computer Science 2025-08-04 Joonmyung Choi , Sanghyeok Lee , Byungoh Ko , Eunseo Kim , Jihyung Kil , Hyunwoo J. Kim

We present Token-UNet, adopting the TokenLearner and TokenFuser modules to encase Transformers into UNets. While Transformers have enabled global interactions among input elements in medical imaging, current computational challenges hinder…

Computer Vision and Pattern Recognition · Computer Science 2026-02-24 Louis Fabrice Tshimanga , Andrea Zanola , Federico Del Pup , Manfredo Atzori

We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of…

Computer Vision and Pattern Recognition · Computer Science 2021-11-02 Anurag Arnab , Mostafa Dehghani , Georg Heigold , Chen Sun , Mario Lučić , Cordelia Schmid

This paper is on video recognition using Transformers. Very recent attempts in this area have demonstrated promising results in terms of recognition accuracy, yet they have been also shown to induce, in many cases, significant computational…

Computer Vision and Pattern Recognition · Computer Science 2021-06-14 Adrian Bulat , Juan-Manuel Perez-Rua , Swathikiran Sudhakaran , Brais Martinez , Georgios Tzimiropoulos

Recently vision transformer has achieved tremendous success on image-level visual recognition tasks. To effectively and efficiently model the crucial temporal information within a video clip, we propose a Temporally Efficient Vision…

Computer Vision and Pattern Recognition · Computer Science 2022-04-19 Shusheng Yang , Xinggang Wang , Yu Li , Yuxin Fang , Jiemin Fang , Wenyu Liu , Xun Zhao , Ying Shan

We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named "TimeSformer," adapts the standard Transformer architecture to video by enabling spatiotemporal…

Computer Vision and Pattern Recognition · Computer Science 2021-06-10 Gedas Bertasius , Heng Wang , Lorenzo Torresani

In this work, we present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs for vision-and-language representation learning with minimal modality-specific design, and do…

Computer Vision and Pattern Recognition · Computer Science 2022-11-03 Zineng Tang , Jaemin Cho , Yixin Nie , Mohit Bansal

With the revolution of generative AI, video-related tasks have been widely studied. However, current state-of-the-art video models still lag behind image models in visual quality and user control over generated content. In this paper, we…

Computer Vision and Pattern Recognition · Computer Science 2025-11-26 Haiming Zhu , Yangyang Xu , Jun Yu , Shengfeng He

Transformers are widely used in computer vision areas and have achieved remarkable success. Most state-of-the-art approaches split images into regular grids and represent each grid region with a vision token. However, fixed token…

Computer Vision and Pattern Recognition · Computer Science 2024-07-17 Wang Zeng , Sheng Jin , Lumin Xu , Wentao Liu , Chen Qian , Wanli Ouyang , Ping Luo , Xiaogang Wang

Most existing transformer based video instance segmentation methods extract per frame features independently, hence it is challenging to solve the appearance deformation problem. In this paper, we observe the temporal information is…

Computer Vision and Pattern Recognition · Computer Science 2023-01-24 Zhenghao Zhang , Fangtao Shao , Zuozhuo Dai , Siyu Zhu

In recent years, transformer-based methods have achieved remarkable progress in medical image segmentation due to their superior ability to capture long-range dependencies. However, these methods typically suffer from two major limitations.…

Computer Vision and Pattern Recognition · Computer Science 2025-08-07 Zunhui Xia , Hongxing Li , Libin Lan

Vision transformers have achieved great successes in many computer vision tasks. Most methods generate vision tokens by splitting an image into a regular and fixed grid and treating each cell as a token. However, not all regions are equally…

Computer Vision and Pattern Recognition · Computer Science 2022-04-22 Wang Zeng , Sheng Jin , Wentao Liu , Chen Qian , Ping Luo , Wanli Ouyang , Xiaogang Wang

While video action recognition has been an active area of research for several years, zero-shot action recognition has only recently started gaining traction. In this work, we propose a novel end-to-end trained transformer model which is…

Computer Vision and Pattern Recognition · Computer Science 2022-12-05 Keval Doshi , Yasin Yilmaz

Computer vision has achieved remarkable success by (a) representing images as uniformly-arranged pixel arrays and (b) convolving highly-localized features. However, convolutions treat all image pixels equally regardless of importance;…

Computer Vision and Pattern Recognition · Computer Science 2020-11-23 Bichen Wu , Chenfeng Xu , Xiaoliang Dai , Alvin Wan , Peizhao Zhang , Zhicheng Yan , Masayoshi Tomizuka , Joseph Gonzalez , Kurt Keutzer , Peter Vajda

The strong demand of autonomous driving in the industry has lead to strong interest in 3D object detection and resulted in many excellent 3D object detection algorithms. However, the vast majority of algorithms only model single-frame data,…

Computer Vision and Pattern Recognition · Computer Science 2020-11-30 Zhenxun Yuan , Xiao Song , Lei Bai , Wengang Zhou , Zhe Wang , Wanli Ouyang

It is a challenging task to learn rich and multi-scale spatiotemporal semantics from high-dimensional videos, due to large local redundancy and complex global dependency between video frames. The recent advances in this research have been…

Computer Vision and Pattern Recognition · Computer Science 2022-02-09 Kunchang Li , Yali Wang , Peng Gao , Guanglu Song , Yu Liu , Hongsheng Li , Yu Qiao
‹ Prev 1 2 3 10 Next ›