Related papers: Discriminative Video Representation Learning Using…
Popular deep models for action recognition in videos generate independent predictions for short clips, which are then pooled heuristically to assign an action label to the full video segment. As not all frames may characterize the…
Most popular deep learning based models for action recognition are designed to generate separate predictions within their short temporal windows, which are often aggregated by heuristic means to assign an action label to the full video…
We propose a novel method for temporally pooling frames in a video for the task of human action recognition. The method is motivated by the observation that there are only a small number of frames which, together, contain sufficient…
In this work, we present novel temporal encoding methods for action and activity classification by extending the unsupervised rank pooling temporal encoding method in two ways. First, we present "discriminative rank pooling" in which the…
Adversarial perturbations are noise-like patterns that can subtly change the data, while failing an otherwise accurate classifier. In this paper, we propose to use such perturbations within a novel contrastive learning setup to build…
Deep learning models for video-based action recognition usually generate features for short clips (consisting of a few frames); such clip-level features are aggregated to video-level representations by computing statistics on these…
We address an essential problem in computer vision, that of unsupervised object segmentation in video, where a main object of interest in a video sequence should be automatically separated from its background. An efficient solution to this…
We propose a novel pooling strategy that learns how to adaptively rank deep convolutional features for selecting more informative representations. To this end, we exploit discriminative analysis to project the features onto a space spanned…
Deep convolutional networks have recently shown excellent performance on Fine-Grained Vehicle Classification. Based on these existing works, we consider that the back-probation algorithm does not focus on extracting less discriminative…
Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted…
Intermediate features at different layers of a deep neural network are known to be discriminative for visual patterns of different complexities. However, most existing works ignore such cross-layer heterogeneities when classifying samples…
We propose a function-based temporal pooling method that captures the latent structure of the video sequence data - e.g. how frame-level features evolve over time in a video. We show how the parameters of a function that has been fit to the…
Most successful deep learning algorithms for action recognition extend models designed for image-based tasks such as object recognition to video. Such extensions are typically trained for actions on single video frames or very short clips,…
Most popular deep models for action recognition split video sequences into short sub-sequences consisting of a few frames; frame-based features are then pooled for recognizing the activity. Usually, this pooling step discards the temporal…
We introduce the concept of "dynamic image", a novel compact representation of videos useful for video analysis, particularly in combination with convolutional neural networks (CNNs). A dynamic image encodes temporal data such as RGB or…
Feature selection is an essential problem in computer vision, important for category learning and recognition. Along with the rapid development of a wide variety of visual features and classifiers, there is a growing need for efficient…
This paper introduces an unsupervised framework to extract semantically rich features for video representation. Inspired by how the human visual system groups objects based on motion cues, we propose a deep convolutional neural network that…
In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels.…
Detecting complex events in a large video collection crawled from video websites is a challenging task. When applying directly good image-based feature representation, e.g., HOG, SIFT, to videos, we have to face the problem of how to pool…
In this paper, we deal with two challenges for measuring the similarity of the subject identities in practical video-based face recognition - the variation of the head pose in uncontrolled environments and the computational expense of…