Related papers: Predicting Action Tubes
In this thesis, we focus on video action understanding problems from an online and real-time processing point of view. We start with the conversion of the traditional offline spatiotemporal action detection pipeline into an online…
Dominant approaches to action detection can only provide sub-optimal solutions to the problem, as they rely on seeking frame-level detections, to later compose them into "action tubes" in a post-processing step. With this paper we radically…
We propose a Temporal Voting Network (TVNet) for action localization in untrimmed videos. This incorporates a novel Voting Evidence Module to locate temporal boundaries, more accurately, where temporal contextual evidence is accumulated to…
Video action anticipation aims to predict future action categories from observed frames. Current state-of-the-art approaches mainly resort to recurrent neural networks to encode history information into hidden states, and predict future…
In this paper we deal with the problem of predicting action progress in videos. We argue that this is an extremely important task since it can be valuable for a wide range of interaction applications. To this end we introduce a novel…
Deep learning has been demonstrated to achieve excellent results for image classification and object detection. However, the impact of deep learning on video analysis (e.g. action detection and recognition) has been limited due to…
Many video understanding tasks work in the offline setting by assuming that the input video is given from the start to the end. However, many real-world problems require the online setting, making a decision immediately using only the…
We propose TubeR: a simple solution for spatio-temporal video action detection. Different from existing methods that depend on either an off-line actor detector or hand-designed actor-positional hypotheses like proposals or anchors, we…
Current state-of-the-art action detection systems are tailored for offline batch-processing applications. However, for online applications like human-robot interaction, current systems fall short, either because they only detect one action…
In this paper, we address the challenging problem of spatial and temporal action detection in videos. We first develop an effective approach to localize frame-level action regions through integrating static and kinematic information by the…
Object detection and object tracking are usually treated as two separate processes. Significant progress has been made for object detection in 2D images using deep learning networks. The usual tracking-by-detection pipeline for object…
Object detection in video is crucial for many applications. Compared to images, video provides additional cues which can help to disambiguate the detection problem. Our goal in this paper is to learn discriminative models for the temporal…
Analyzing human actions in videos has gained increased attention recently. While most works focus on classifying and labeling observed video frames or anticipating the very recent future, making long-term predictions over more than just a…
We address the problem of action detection in videos. Driven by the latest progress in object detection from 2D images, we build action models using rich feature hierarchies derived from shape and kinematic cues. We incorporate appearance…
We present a deep-learning framework for real-time multiple spatio-temporal (S/T) action localisation, classification and early prediction. Current state-of-the-art approaches work offline and are too slow to be useful in real- world…
In this paper, we propose Two-Stream AMTnet, which leverages recent advances in video-based action representation[1] and incremental action tube generation[2]. Majority of the present action detectors follow a frame-based representation, a…
Current state-of-the-art approaches for spatio-temporal action localization rely on detections at the frame level that are then linked or tracked across time. In this paper, we leverage the temporal continuity of videos instead of operating…
We develop a novel framework for action localization in videos. We propose the Tube Proposal Network (TPN), which can generate generic, class-independent, video-level tubelet proposals in videos. The generated tubelet proposals can be…
In this paper, we propose a new framework for action localization that tracks people in videos and extracts full-body human tubes, i.e., spatio-temporal regions localizing actions, even in the case of occlusions or truncations. This is…
The recent advances in Deep Convolutional Neural Networks (DCNNs) have shown extremely good results for video human action classification, however, action detection is still a challenging problem. The current action detection approaches…