Related papers: Temporal-Spatial Mapping for Action Recognition

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles…

Computer Vision and Pattern Recognition · Computer Science 2016-08-03 Limin Wang , Yuanjun Xiong , Zhe Wang , Yu Qiao , Dahua Lin , Xiaoou Tang , Luc Van Gool

Temporal Segment Networks for Action Recognition in Videos

Deep convolutional networks have achieved great success for image recognition. However, for action recognition in videos, their advantage over traditional methods is not so evident. We present a general and flexible video-level framework…

Computer Vision and Pattern Recognition · Computer Science 2017-05-09 Limin Wang , Yuanjun Xiong , Zhe Wang , Yu Qiao , Dahua Lin , Xiaoou Tang , Luc Van Gool

Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection

Temporal human action detection aims to identify and localize action segments within untrimmed videos, serving as a pivotal task in video understanding. Despite the progress achieved by prior architectures like CNN and Transformer models,…

Computer Vision and Pattern Recognition · Computer Science 2026-04-13 Yicheng Qiu , Keiji Yanai

StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

Despite the success of deep learning for static image understanding, it remains unclear what are the most effective network architectures for the spatial-temporal modeling in videos. In this paper, in contrast to the existing CNN+RNN or…

Computer Vision and Pattern Recognition · Computer Science 2018-12-12 Dongliang He , Zhichao Zhou , Chuang Gan , Fu Li , Xiao Liu , Yandong Li , Limin Wang , Shilei Wen

Collaborative Spatio-temporal Feature Learning for Video Action Recognition

Spatio-temporal feature learning is of central importance for action recognition in videos. Existing deep neural network models either learn spatial and temporal features independently (C2D) or jointly with unconstrained parameters (C3D).…

Computer Vision and Pattern Recognition · Computer Science 2019-03-05 Chao Li , Qiaoyong Zhong , Di Xie , Shiliang Pu

When Spatial meets Temporal in Action Recognition

Video action recognition has made significant strides, but challenges remain in effectively using both spatial and temporal information. While existing methods often focus on either spatial features (e.g., object appearance) or temporal…

Computer Vision and Pattern Recognition · Computer Science 2024-11-26 Huilin Chen , Lei Wang , Yifan Chen , Tom Gedeon , Piotr Koniusz

Temporal Extension of Scale Pyramid and Spatial Pyramid Matching for Action Recognition

Historically, researchers in the field have spent a great deal of effort to create image representations that have scale invariance and retain spatial location information. This paper proposes to encode equivalent temporal characteristics…

Computer Vision and Pattern Recognition · Computer Science 2014-09-01 Zhenzhong Lan , Xuanchong Li , Alexandar G. Hauptmann

CTM: Collaborative Temporal Modeling for Action Recognition

With the rapid development of digital multimedia, video understanding has become an important field. For action recognition, temporal dimension plays an important role, and this is quite different from image recognition. In order to learn…

Computer Vision and Pattern Recognition · Computer Science 2020-02-11 Qian Liu , Tao Wang , Jie Liu , Yang Guan , Qi Bu , Longfei Yang

Temporal Bilinear Networks for Video Action Recognition

Temporal modeling in videos is a fundamental yet challenging problem in computer vision. In this paper, we propose a novel Temporal Bilinear (TB) model to capture the temporal pairwise feature interactions between adjacent frames. Compared…

Computer Vision and Pattern Recognition · Computer Science 2018-11-27 Yanghao Li , Sijie Song , Yuqi Li , Jiaying Liu

Temporal Visual Semantics-Induced Human Motion Understanding with Large Language Models

Unsupervised human motion segmentation (HMS) can be effectively achieved using subspace clustering techniques. However, traditional methods overlook the role of temporal semantic exploration in HMS. This paper explores the use of temporal…

Machine Learning · Computer Science 2025-12-30 Zheng Xing , Weibing Zhao

Motion-driven Visual Tempo Learning for Video-based Action Recognition

Action visual tempo characterizes the dynamics and the temporal scale of an action, which is helpful to distinguish human actions that share high similarities in visual dynamics and appearance. Previous methods capture the visual tempo…

Computer Vision and Pattern Recognition · Computer Science 2022-07-13 Yuanzhong Liu , Junsong Yuan , Zhigang Tu

Jointly Learning Structured Representations and Stabilized Affinity for Human Motion Segmentation

Human Motion Segmentation (HMS), which aims to partition a video into non-overlapping segments corresponding to different human motions, has recently attracted increasing research attention. Existing HMS approaches are predominantly based…

Computer Vision and Pattern Recognition · Computer Science 2026-05-08 Xianghan Meng , Zhiyuan Huang , Zhengyu Tong , Chun-Guang Li

Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition

Spatio-temporal convolution often fails to learn motion dynamics in videos and thus an effective motion representation is required for video understanding in the wild. In this paper, we propose a rich and robust motion representation based…

Computer Vision and Pattern Recognition · Computer Science 2021-11-03 Heeseung Kwon , Manjin Kim , Suha Kwak , Minsu Cho

STM: SpatioTemporal and Motion Encoding for Action Recognition

Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion…

Computer Vision and Pattern Recognition · Computer Science 2019-08-19 Boyuan Jiang , Mengmeng Wang , Weihao Gan , Wei Wu , Junjie Yan

Exploiting long-term temporal dynamics for video captioning

Automatically describing videos with natural language is a fundamental challenge for computer vision and natural language processing. Recently, progress in this problem has been achieved through two steps: 1) employing 2-D and/or 3-D…

Computer Vision and Pattern Recognition · Computer Science 2022-02-23 Yuyu Guo , Jingqiu Zhang , Lianli Gao

MeMSVD: Long-Range Temporal Structure Capturing Using Incremental SVD

This paper is on long-term video understanding where the goal is to recognise human actions over long temporal windows (up to minutes long). In prior work, long temporal context is captured by constructing a long-term memory bank consisting…

Computer Vision and Pattern Recognition · Computer Science 2024-06-12 Ioanna Ntinou , Enrique Sanchez , Georgios Tzimiropoulos

Temporal Action Localization with Multi-temporal Scales

Temporal action localization plays an important role in video analysis, which aims to localize and classify actions in untrimmed videos. The previous methods often predict actions on a feature space of a single-temporal scale. However, the…

Computer Vision and Pattern Recognition · Computer Science 2022-08-17 Zan Gao , Xinglei Cui , Tao Zhuo , Zhiyong Cheng , An-An Liu , Meng Wang , Shenyong Chen

TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge Device

The explosive growth in video streaming requires video understanding at high accuracy and low computation cost. Conventional 2D CNNs are computationally cheap but cannot capture temporal relationships; 3D CNN-based methods can achieve good…

Computer Vision and Pattern Recognition · Computer Science 2021-09-28 Ji Lin , Chuang Gan , Kuan Wang , Song Han

Exploring Stronger Feature for Temporal Action Localization

Temporal action localization aims to localize starting and ending time with action category. Limited by GPU memory, mainstream methods pre-extract features for each video. Therefore, feature quality determines the upper bound of detection…

Computer Vision and Pattern Recognition · Computer Science 2021-06-25 Zhiwu Qing , Xiang Wang , Ziyuan Huang , Yutong Feng , Shiwei Zhang , jianwen Jiang , Mingqian Tang , Changxin Gao , Nong Sang

TDN: Temporal Difference Networks for Efficient Action Recognition

Temporal modeling still remains challenging for action recognition in videos. To mitigate this issue, this paper presents a new video architecture, termed as Temporal Difference Network (TDN), with a focus on capturing multi-scale temporal…

Computer Vision and Pattern Recognition · Computer Science 2021-04-02 Limin Wang , Zhan Tong , Bin Ji , Gangshan Wu