Related papers: Selective Feature Compression for Efficient Activi…
Action recognition is computationally expensive. In this paper, we address the problem of frame selection to improve the accuracy of action recognition. In particular, we show that selecting good frames helps in action recognition…
Although dense local spatial-temporal features with bag-of-features representation achieve state-of-the-art performance for action recognition, the huge feature number and feature size prevent current methods from scaling up to real size…
Adaptive sampling that exploits the spatiotemporal redundancy in videos is critical for always-on action recognition on wearable devices with limited computing and battery resources. The commonly used fixed sampling strategy is not…
While many action recognition datasets consist of collections of brief, trimmed videos each containing a relevant action, videos in the real-world (e.g., on YouTube) exhibit very different properties: they are often several minutes long,…
Though action recognition in videos has achieved great success recently, it remains a challenging task due to the massive computational cost. Designing lightweight networks is a possible solution, but it may degrade the recognition…
Slow Feature Analysis (SFA) extracts slowly varying features from a quickly varying input signal. It has been successfully applied to modeling the visual receptive fields of the cortical neurons. Sufficient experimental results in…
The existing action recognition methods are mainly based on clip-level classifiers such as two-stream CNNs or 3D CNNs, which are trained from the randomly selected clips and applied to densely sampled clips during testing. However, this…
This paper proposes a novel communication-efficient split learning (SL) framework, named SplitFC, which reduces the communication overhead required for transmitting intermediate feature and gradient vectors during the SL training process.…
Temporal feature extraction is an important issue in video-based action recognition. Optical flow is a popular method to extract temporal feature, which produces excellent performance thanks to its capacity of capturing pixel-level…
In recent years, it has been found that screen content images (SCI) can be effectively compressed based on appropriate probability modelling and suitable entropy coding methods such as arithmetic coding. The key objective is determining the…
For semantic segmentation, most existing real-time deep models trained with each frame independently may produce inconsistent results for a video sequence. Advanced methods take into considerations the correlations in the video sequence,…
We explore the efficiency of the CRF inference beyond image level semantic segmentation and perform joint inference in video frames. The key idea is to combine best of two worlds: semantic co-labeling and more expressive models. Our…
With the rapid development of large multimodal models (LMMs), multimodal understanding applications are emerging. As most LMM inference requests originate from edge devices with limited computational capabilities, the predominant inference…
Feature coding has been recently considered to facilitate intelligent video analysis for urban computing. Instead of raw videos, extracted features in the front-end are encoded and transmitted to the back-end for further processing. In this…
The task of spatial-temporal action detection has attracted increasing attention among researchers. Existing dominant methods solve this problem by relying on short-term information and dense serial-wise detection on each individual frames…
This paper introduces a novel methodology for Feature Selection for Functional Classification, FSFC, that addresses the challenge of jointly performing feature selection and classification of functional data in scenarios with categorical…
We propose Sequential Feature Filtering Classifier (FFC), a simple but effective classifier for convolutional neural networks (CNNs). With sequential LayerNorm and ReLU, FFC zeroes out low-activation units and preserves high-activation…
Training robust deep video representations has proven to be computationally challenging due to substantial decoding overheads, the enormous size of raw video streams, and their inherent high temporal redundancy. Different from existing…
Inspired by the facts that retinal cells actually segregate the visual scene into different attributes (e.g., spatial details, temporal motion) for respective neuronal processing, we propose to first decompose the input video into…
There has been a dramatic increase in the volume of videos and their related content uploaded to the internet. Accordingly, the need for efficient algorithms to analyse this vast amount of data has attracted significant research interest.…