Related papers: How can objects help action recognition?

Rethinking Image-to-Video Adaptation: An Object-centric Perspective

Image-to-video adaptation seeks to efficiently adapt image models for use in the video domain. Instead of finetuning the entire image backbone, many image-to-video adaptation paradigms use lightweight adapters for temporal modeling on top…

Computer Vision and Pattern Recognition · Computer Science 2024-07-10 Rui Qian , Shuangrui Ding , Dahua Lin

Is an Object-Centric Video Representation Beneficial for Transfer?

The objective of this work is to learn an object-centric video representation, with the aim of improving transferability to novel tasks, i.e., tasks different from the pre-training task of action classification. To this end, we introduce a…

Computer Vision and Pattern Recognition · Computer Science 2022-10-11 Chuhan Zhang , Ankush Gupta , Andrew Zisserman

Simultaneous Detection and Interaction Reasoning for Object-Centric Action Recognition

The interactions between human and objects are important for recognizing object-centric actions. Existing methods usually adopt a two-stage pipeline, where object proposals are first detected using a pretrained detector, and then are fed to…

Computer Vision and Pattern Recognition · Computer Science 2024-04-19 Xunsong Li , Pengzhan Sun , Yangcen Liu , Lixin Duan , Wen Li

Modelling Spatio-Temporal Interactions For Compositional Action Recognition

Humans have the natural ability to recognize actions even if the objects involved in the action or the background are changed. Humans can abstract away the action from the appearance of the objects which is referred to as compositionality…

Computer Vision and Pattern Recognition · Computer Science 2024-10-28 Ramanathan Rajendiran , Debaditya Roy , Basura Fernando

Attend and Interact: Higher-Order Object Interactions for Video Understanding

Human actions often involve complex interactions across several inter-related objects in the scene. However, existing approaches to fine-grained video understanding or visual relationship detection often rely on single object representation…

Computer Vision and Pattern Recognition · Computer Science 2018-03-22 Chih-Yao Ma , Asim Kadav , Iain Melvin , Zsolt Kira , Ghassan AlRegib , Hans Peter Graf

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks. Instead of relying on hand-designed splitting…

Computer Vision and Pattern Recognition · Computer Science 2022-04-05 Michael S. Ryoo , AJ Piergiovanni , Anurag Arnab , Mostafa Dehghani , Anelia Angelova

Improving Token-based Object Detection with Video

This paper improves upon the Pix2Seq object detector by extending it for videos. In the process, it introduces a new way to perform end-to-end video object detection that improves upon existing video detectors in two key ways. First, by…

Computer Vision and Pattern Recognition · Computer Science 2025-08-21 Abhineet Singh , Nilanjan Ray

Object Level Visual Reasoning in Videos

Human activity recognition is typically addressed by detecting key concepts like global and local motion, features related to object classes present in the scene, as well as features related to the global context. The next open challenges…

Computer Vision and Pattern Recognition · Computer Science 2018-09-21 Fabien Baradel , Natalia Neverova , Christian Wolf , Julien Mille , Greg Mori

Segmenting Moving Objects via an Object-Centric Layered Representation

The objective of this paper is a model that is able to discover, track and segment multiple moving objects in a video. We make four contributions: First, we introduce an object-centric segmentation model with a depth-ordered layer…

Computer Vision and Pattern Recognition · Computer Science 2022-11-15 Junyu Xie , Weidi Xie , Andrew Zisserman

Helping Hands: An Object-Aware Ego-Centric Video Recognition Model

We introduce an object-aware decoder for improving the performance of spatio-temporal representations on ego-centric videos. The key idea is to enhance object-awareness during training by tasking the model to predict hand positions, object…

Computer Vision and Pattern Recognition · Computer Science 2023-08-16 Chuhan Zhang , Ankush Gupta , Andrew Zisserman

Principles of Visual Tokens for Efficient Video Understanding

Video understanding has made huge strides in recent years, relying largely on the power of transformers. As this architecture is notoriously expensive and video data is highly redundant, research into improving efficiency has become…

Computer Vision and Pattern Recognition · Computer Science 2025-03-25 Xinyue Hao , Gen Li , Shreyank N Gowda , Robert B Fisher , Jonathan Huang , Anurag Arnab , Laura Sevilla-Lara

Motion Guided Attention Fusion to Recognize Interactions from Videos

We present a dual-pathway approach for recognizing fine-grained interactions from videos. We build on the success of prior dual-stream approaches, but make a distinction between the static and dynamic representations of objects and their…

Computer Vision and Pattern Recognition · Computer Science 2021-04-02 Tae Soo Kim , Jonathan Jones , Gregory D. Hager

Object-Centric Framework for Video Moment Retrieval

Most existing video moment retrieval methods rely on temporal sequences of frame- or clip-level features that primarily encode global visual and semantic information. However, such representations often fail to capture fine-grained object…

Computer Vision and Pattern Recognition · Computer Science 2025-12-23 Zongyao Li , Yongkang Wong , Satoshi Yamazaki , Jianquan Liu , Mohan Kankanhalli

Learning to Segment Moving Objects in Videos

We segment moving objects in videos by ranking spatio-temporal segment proposals according to "moving objectness": how likely they are to contain a moving object. In each video frame, we compute segment proposals using multiple…

Computer Vision and Pattern Recognition · Computer Science 2015-05-11 Katerina Fragkiadaki , Pablo Arbelaez , Panna Felsen , Jitendra Malik

Learning Video Object Segmentation from Static Images

Inspired by recent advances of deep learning in instance segmentation and object tracking, we introduce video object segmentation problem as a concept of guided instance segmentation. Our model proceeds on a per-frame basis, guided by the…

Computer Vision and Pattern Recognition · Computer Science 2019-02-05 Anna Khoreva , Federico Perazzi , Rodrigo Benenson , Bernt Schiele , Alexander Sorkine-Hornung

Activity Recognition on a Large Scale in Short Videos - Moments in Time Dataset

Moments capture a huge part of our lives. Accurate recognition of these moments is challenging due to the diverse and complex interpretation of the moments. Action recognition refers to the act of classifying the desired action/activity…

Computer Vision and Pattern Recognition · Computer Science 2018-09-14 Ankit Shah , Harini Kesavamoorthy , Poorva Rane , Pramati Kalwad , Alexander Hauptmann , Florian Metze

Fine-tuning Image Transformers using Learnable Memory

In this paper we propose augmenting Vision Transformer models with learnable memory tokens. Our approach allows the model to adapt to new tasks, using few parameters, while optionally preserving its capabilities on previously learned tasks.…

Computer Vision and Pattern Recognition · Computer Science 2022-03-31 Mark Sandler , Andrey Zhmoginov , Max Vladymyrov , Andrew Jackson

Action Recognition using Visual Attention

We propose a soft attention based model for the task of action recognition in videos. We use multi-layered Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units which are deep both spatially and temporally. Our model…

Machine Learning · Computer Science 2016-02-16 Shikhar Sharma , Ryan Kiros , Ruslan Salakhutdinov

Towards Segmenting Anything That Moves

Detecting and segmenting individual objects, regardless of their category, is crucial for many applications such as action detection or robotic interaction. While this problem has been well-studied under the classic formulation of…

Computer Vision and Pattern Recognition · Computer Science 2020-04-02 Achal Dave , Pavel Tokmakov , Deva Ramanan

Long Activity Video Understanding using Functional Object-Oriented Network

Video understanding is one of the most challenging topics in computer vision. In this paper, a four-stage video understanding pipeline is presented to simultaneously recognize all atomic actions and the single on-going activity in a video.…

Computer Vision and Pattern Recognition · Computer Science 2018-07-04 Ahmad Babaeian Jelodar , David Paulius , Yu Sun