Related papers: Efficient Spatialtemporal Context Modeling for Act…

Deep Action- and Context-Aware Sequence Learning for Activity Recognition and Anticipation

Action recognition and anticipation are key to the success of many computer vision applications. Existing methods can roughly be grouped into those that extract global, context-aware representations of the entire image or sequence, and…

Computer Vision and Pattern Recognition · Computer Science 2016-11-21 Mohammad Sadegh Aliakbarian , Fatemehsadat Saleh , Basura Fernando , Mathieu Salzmann , Lars Petersson , Lars Andersson

Spatio-Temporal Context for Action Detection

Research in action detection has grown in the recentyears, as it plays a key role in video understanding. Modelling the interactions (either spatial or temporal) between actors and their context has proven to be essential for this task.…

Computer Vision and Pattern Recognition · Computer Science 2021-06-30 Manuel Sarmiento Calderó , David Varas , Elisenda Bou-Balust

Efficient Spatial-Temporal Modeling for Real-Time Video Analysis: A Unified Framework for Action Recognition and Object Tracking

Real-time video analysis remains a challenging problem in computer vision, requiring efficient processing of both spatial and temporal information while maintaining computational efficiency. Existing approaches often struggle to balance…

Computer Vision and Pattern Recognition · Computer Science 2025-07-31 Shahla John

Efficient Modelling Across Time of Human Actions and Interactions

This thesis focuses on video understanding for human action and interaction recognition. We start by identifying the main challenges related to action recognition from videos and review how they have been addressed by current methods. Based…

Computer Vision and Pattern Recognition · Computer Science 2021-10-06 Alexandros Stergiou

LTCA: Long-range Temporal Context Attention for Referring Video Object Segmentation

Referring Video Segmentation (RVOS) aims to segment objects in videos given linguistic expressions. The key to solving RVOS is to extract long-range temporal context information from the interactions of expressions and videos to depict the…

Computer Vision and Pattern Recognition · Computer Science 2025-10-10 Cilin Yan , Jingyun Wang , Guoliang Kang

3D Convolutional with Attention for Action Recognition

Human action recognition is one of the challenging tasks in computer vision. The current action recognition methods use computationally expensive models for learning spatio-temporal dependencies of the action. Models utilizing RGB channels…

Computer Vision and Pattern Recognition · Computer Science 2022-06-07 Labina Shrestha , Shikha Dubey , Farrukh Olimov , Muhammad Aasim Rafique , Moongu Jeon

Relational Long Short-Term Memory for Video Action Recognition

Spatial and temporal relationships, both short-range and long-range, between objects in videos, are key cues for recognizing actions. It is a challenging problem to model them jointly. In this paper, we first present a new variant of Long…

Computer Vision and Pattern Recognition · Computer Science 2020-04-28 Zexi Chen , Bharathkumar Ramachandra , Tianfu Wu , Ranga Raju Vatsavai

GTA: Global Temporal Attention for Video Action Understanding

Self-attention learns pairwise interactions to model long-range dependencies, yielding great improvements for video action recognition. In this paper, we seek a deeper understanding of self-attention for temporal modeling in videos. We…

Computer Vision and Pattern Recognition · Computer Science 2022-03-30 Bo He , Xitong Yang , Zuxuan Wu , Hao Chen , Ser-Nam Lim , Abhinav Shrivastava

Alignment-guided Temporal Attention for Video Action Recognition

Temporal modeling is crucial for various video learning tasks. Most recent approaches employ either factorized (2D+1D) or joint (3D) spatial-temporal operations to extract temporal contexts from the input frames. While the former is more…

Computer Vision and Pattern Recognition · Computer Science 2023-01-03 Yizhou Zhao , Zhenyang Li , Xun Guo , Yan Lu

Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention

Attentive video modeling is essential for action recognition in unconstrained videos due to their rich yet redundant information over space and time. However, introducing attention in a deep neural network for action recognition is…

Computer Vision and Pattern Recognition · Computer Science 2020-04-06 Juan-Manuel Perez-Rua , Brais Martinez , Xiatian Zhu , Antoine Toisoul , Victor Escorcia , Tao Xiang

Context-Aware Network Based on Multi-scale Spatio-temporal Attention for Action Recognition in Videos

Action recognition is a critical task in video understanding, requiring the comprehensive capture of spatio-temporal cues across various scales. However, existing methods often overlook the multi-granularity nature of actions. To address…

Computer Vision and Pattern Recognition · Computer Science 2025-12-23 Xiaoyang Li , Wenzhu Yang , Kanglin Wang , Tiebiao Wang , Qingsong Fei

Contextual Multi-Scale Region Convolutional 3D Network for Activity Detection

Activity detection is a fundamental problem in computer vision. Detecting activities of different temporal scales is particularly challenging. In this paper, we propose the contextual multi-scale region convolutional 3D network (CMS-RC3D)…

Computer Vision and Pattern Recognition · Computer Science 2018-01-30 Yancheng Bai , Huijuan Xu , Kate Saenko , Bernard Ghanem

Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition

Recent video recognition models utilize Transformer models for long-range spatio-temporal context modeling. Video transformer designs are based on self-attention that can model global context at a high computational cost. In comparison,…

Computer Vision and Pattern Recognition · Computer Science 2023-10-30 Syed Talal Wasim , Muhammad Uzair Khattak , Muzammal Naseer , Salman Khan , Mubarak Shah , Fahad Shahbaz Khan

Attentive Action and Context Factorization

We propose a method for human action recognition, one that can localize the spatiotemporal regions that `define' the actions. This is a challenging task due to the subtlety of human actions in video and the co-occurrence of contextual…

Computer Vision and Pattern Recognition · Computer Science 2019-04-12 Yang Wang , Vinh Tran , Gedas Bertasius , Lorenzo Torresani , Minh Hoai

Skeleton-Based Human Action Recognition with Global Context-Aware Attention LSTM Networks

Human action recognition in 3D skeleton sequences has attracted a lot of research attention. Recently, Long Short-Term Memory (LSTM) networks have shown promising performance in this task due to their strengths in modeling the dependencies…

Computer Vision and Pattern Recognition · Computer Science 2018-02-14 Jun Liu , Gang Wang , Ling-Yu Duan , Kamila Abdiyeva , Alex C. Kot

CTM: Collaborative Temporal Modeling for Action Recognition

With the rapid development of digital multimedia, video understanding has become an important field. For action recognition, temporal dimension plays an important role, and this is quite different from image recognition. In order to learn…

Computer Vision and Pattern Recognition · Computer Science 2020-02-11 Qian Liu , Tao Wang , Jie Liu , Yang Guan , Qi Bu , Longfei Yang

On the Importance of Spatial Relations for Few-shot Action Recognition

Deep learning has achieved great success in video recognition, yet still struggles to recognize novel actions when faced with only a few examples. To tackle this challenge, few-shot action recognition methods have been proposed to transfer…

Computer Vision and Pattern Recognition · Computer Science 2023-08-15 Yilun Zhang , Yuqian Fu , Xingjun Ma , Lizhe Qi , Jingjing Chen , Zuxuan Wu , Yu-Gang Jiang

Interpretable Spatio-temporal Attention for Video Action Recognition

Inspired by the observation that humans are able to process videos efficiently by only paying attention where and when it is needed, we propose an interpretable and easy plug-in spatial-temporal attention mechanism for video action…

Computer Vision and Pattern Recognition · Computer Science 2019-06-04 Lili Meng , Bo Zhao , Bo Chang , Gao Huang , Wei Sun , Frederich Tung , Leonid Sigal

Cross-Attention is Not Always Needed: Dynamic Cross-Attention for Audio-Visual Dimensional Emotion Recognition

In video-based emotion recognition, audio and visual modalities are often expected to have a complementary relationship, which is widely explored using cross-attention. However, they may also exhibit weak complementary relationships,…

Computer Vision and Pattern Recognition · Computer Science 2024-03-29 R. Gnana Praveen , Jahangir Alam

Coarse Temporal Attention Network (CTA-Net) for Driver's Activity Recognition

There is significant progress in recognizing traditional human activities from videos focusing on highly distinctive actions involving discriminative body movements, body-object and/or human-human interactions. Driver's activities are…

Computer Vision and Pattern Recognition · Computer Science 2021-01-19 Zachary Wharton , Ardhendu Behera , Yonghuai Liu , Nik Bessis