Related papers: Context-Aware Sequence Alignment using 4D Skeletal…

Learning by Aligning 2D Skeleton Sequences and Multi-Modality Fusion

This paper presents a self-supervised temporal video alignment framework which is useful for several fine-grained human activity understanding applications. In contrast with the state-of-the-art method of CASA, where sequences of 3D…

Computer Vision and Pattern Recognition · Computer Science 2025-12-02 Quoc-Huy Tran , Muhammad Ahmed , Murad Popattia , M. Hassan Ahmed , Andrey Konin , M. Zeeshan Zia

Deep-Learning-Assisted Analysis of Cataract Surgery Videos

Following the technological advancements in medicine, the operation rooms are evolving into intelligent environments. The context-aware systems (CAS) can comprehensively interpret the surgical state, enable real-time warning, and support…

Computer Vision and Pattern Recognition · Computer Science 2023-12-12 Negin Ghamsarian

SSAN: Separable Self-Attention Network for Video Representation Learning

Self-attention has been successfully applied to video representation learning due to the effectiveness of modeling long range dependencies. Existing approaches build the dependencies merely by computing the pairwise correlations along…

Computer Vision and Pattern Recognition · Computer Science 2021-05-28 Xudong Guo , Xun Guo , Yan Lu

Self-Supervised Contrastive Learning for Videos using Differentiable Local Alignment

Robust frame-wise embeddings are essential to perform video analysis and understanding tasks. We present a self-supervised method for representation learning based on aligning temporal video sequences. Our framework uses a transformer-based…

Computer Vision and Pattern Recognition · Computer Science 2025-03-04 Keyne Oei , Amr Gomaa , Anna Maria Feit , João Belo

CASA: Category-agnostic Skeletal Animal Reconstruction

Recovering the skeletal shape of an animal from a monocular video is a longstanding challenge. Prevailing animal reconstruction methods often adopt a control-point driven animation model and optimize bone transforms individually without…

Computer Vision and Pattern Recognition · Computer Science 2022-11-08 Yuefan Wu , Zeyuan Chen , Shaowei Liu , Zhongzheng Ren , Shenlong Wang

Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment

The egocentric and exocentric viewpoints of a human activity look dramatically different, yet invariant representations to link them are essential for many potential applications in robotics and augmented reality. Prior work is limited to…

Computer Vision and Pattern Recognition · Computer Science 2023-11-28 Zihui Xue , Kristen Grauman

Alignment-guided Temporal Attention for Video Action Recognition

Temporal modeling is crucial for various video learning tasks. Most recent approaches employ either factorized (2D+1D) or joint (3D) spatial-temporal operations to extract temporal contexts from the input frames. While the former is more…

Computer Vision and Pattern Recognition · Computer Science 2023-01-03 Yizhou Zhao , Zhenyang Li , Xun Guo , Yan Lu

Enhanced Spatio-Temporal Context for Temporally Consistent Robust 3D Human Motion Recovery from Monocular Videos

Recovering temporally consistent 3D human body pose, shape and motion from a monocular video is a challenging task due to (self-)occlusions, poor lighting conditions, complex articulated body poses, depth ambiguity, and limited availability…

Computer Vision and Pattern Recognition · Computer Science 2023-11-21 Sushovan Chanda , Amogh Tiwari , Lokender Tiwari , Brojeshwar Bhowmick , Avinash Sharma , Hrishav Barua

CASAPose: Class-Adaptive and Semantic-Aware Multi-Object Pose Estimation

Applications in the field of augmented reality or robotics often require joint localisation and 6D pose estimation of multiple objects. However, most algorithms need one network per object class to be trained in order to provide the best…

Computer Vision and Pattern Recognition · Computer Science 2022-12-12 Niklas Gard , Anna Hilsmann , Peter Eisert

A Graph Attention Spatio-temporal Convolutional Network for 3D Human Pose Estimation in Video

Spatio-temporal information is key to resolve occlusion and depth ambiguity in 3D pose estimation. Previous methods have focused on either temporal contexts or local-to-global architectures that embed fixed-length spatio-temporal…

Computer Vision and Pattern Recognition · Computer Science 2020-10-21 Junfa Liu , Juan Rojas , Zhijun Liang , Yihui Li , Yisheng Guan

Learning by Aligning Videos in Time

We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task, while exploiting both frame-level and video-level information. We leverage a novel combination of temporal alignment…

Computer Vision and Pattern Recognition · Computer Science 2023-08-21 Sanjay Haresh , Sateesh Kumar , Huseyin Coskun , Shahram Najam Syed , Andrey Konin , Muhammad Zeeshan Zia , Quoc-Huy Tran

Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations

Spatially dense self-supervised learning is a rapidly growing problem domain with promising applications for unsupervised segmentation and pretraining for dense downstream tasks. Despite the abundance of temporal data in the form of videos,…

Computer Vision and Pattern Recognition · Computer Science 2023-08-24 Mohammadreza Salehi , Efstratios Gavves , Cees G. M. Snoek , Yuki M. Asano

Learning to Align Sequential Actions in the Wild

State-of-the-art methods for self-supervised sequential action alignment rely on deep networks that find correspondences across videos in time. They either learn frame-to-frame mapping across sequences, which does not leverage temporal…

Computer Vision and Pattern Recognition · Computer Science 2021-11-18 Weizhe Liu , Bugra Tekin , Huseyin Coskun , Vibhav Vineet , Pascal Fua , Marc Pollefeys

Dynamic Gaussian Scene Reconstruction from Unsynchronized Videos

Multi-view video reconstruction plays a vital role in computer vision, enabling applications in film production, virtual reality, and motion analysis. While recent advances such as 4D Gaussian Splatting (4DGS) have demonstrated impressive…

Computer Vision and Pattern Recognition · Computer Science 2025-11-17 Zhixin Xu , Hengyu Zhou , Yuan Liu , Wenhan Xue , Hao Pan , Wenping Wang , Bin Wang

Deep Action- and Context-Aware Sequence Learning for Activity Recognition and Anticipation

Action recognition and anticipation are key to the success of many computer vision applications. Existing methods can roughly be grouped into those that extract global, context-aware representations of the entire image or sequence, and…

Computer Vision and Pattern Recognition · Computer Science 2016-11-21 Mohammad Sadegh Aliakbarian , Fatemehsadat Saleh , Basura Fernando , Mathieu Salzmann , Lars Petersson , Lars Andersson

Efficient Modelling Across Time of Human Actions and Interactions

This thesis focuses on video understanding for human action and interaction recognition. We start by identifying the main challenges related to action recognition from videos and review how they have been addressed by current methods. Based…

Computer Vision and Pattern Recognition · Computer Science 2021-10-06 Alexandros Stergiou

On the Importance of Spatial Relations for Few-shot Action Recognition

Deep learning has achieved great success in video recognition, yet still struggles to recognize novel actions when faced with only a few examples. To tackle this challenge, few-shot action recognition methods have been proposed to transfer…

Computer Vision and Pattern Recognition · Computer Science 2023-08-15 Yilun Zhang , Yuqian Fu , Xingjun Ma , Lizhe Qi , Jingjing Chen , Zuxuan Wu , Yu-Gang Jiang

Selective Spatio-Temporal Aggregation Based Pose Refinement System: Towards Understanding Human Activities in Real-World Videos

Taking advantage of human pose data for understanding human activities has attracted much attention these days. However, state-of-the-art pose estimators struggle in obtaining high-quality 2D or 3D pose data due to occlusion, truncation and…

Computer Vision and Pattern Recognition · Computer Science 2020-11-12 Di Yang , Rui Dai , Yaohui Wang , Rupayan Mallick , Luca Minciullo , Gianpiero Francesca , Francois Bremond

Extending Temporal Data Augmentation for Video Action Recognition

Pixel space augmentation has grown in popularity in many Deep Learning areas, due to its effectiveness, simplicity, and low computational cost. Data augmentation for videos, however, still remains an under-explored research topic, as most…

Computer Vision and Pattern Recognition · Computer Science 2022-11-10 Artjoms Gorpincenko , Michal Mackiewicz

CAST: Cross-Attention in Space and Time for Video Action Recognition

Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture,…

Computer Vision and Pattern Recognition · Computer Science 2024-09-04 Dongho Lee , Jongseo Lee , Jinwoo Choi