Related papers: Multi-Modality Co-Learning for Efficient Skeleton-…
Skeleton-based action recognition aims to project skeleton sequences to action categories, where skeleton sequences are derived from multiple forms of pre-detected points. Compared with earlier methods that focus on exploring single-form…
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training. Previous research has focused on aligning sequences' visual and semantic spatial…
Multimodal large language models (MLLMs) exhibit strong visual-language reasoning, yet cannot process structured, non-visual data such as human skeletons. Existing methods either compress skeleton dynamics into lossy feature vectors for…
Multimodal sentiment analysis (MSA) identifies individuals' sentiment states in videos by integrating visual, audio, and text modalities. Despite progress in existing methods, the inherent modality heterogeneity limits the effective capture…
Contrastive learning has gained significant attention in skeleton-based action recognition for its ability to learn robust representations from unlabeled data. However, existing methods rely on a single skeleton convention, which limits…
Skeleton-based human action recognition has achieved remarkable progress in recent years. However, most existing GCN-based methods rely on short-range motion topologies, which not only struggle to capture long-range joint dependencies and…
Action recognition has been a heated topic in computer vision for its wide application in vision systems. Previous approaches achieve improvement by fusing the modalities of the skeleton sequence and RGB video. However, such methods have a…
With the prevalence of RGB-D cameras, multi-modal video data have become more available for human action recognition. One main challenge for this task lies in how to effectively leverage their complementary information. In this work, we…
In this paper, we address self-supervised representation learning from human skeletons for action recognition. Previous methods, which usually learn feature presentations from a single reconstruction task, may come across the overfitting…
Sign language is commonly used by deaf or speech impaired people to communicate but requires significant effort to master. Sign Language Recognition (SLR) aims to bridge the gap between sign language users and others by recognizing signs…
Human action recognition (HAR) with multi-modal inputs (RGB-D, skeleton, point cloud) can achieve high accuracy but typically relies on large labeled datasets and degrades sharply when sensors fail or are noisy. We present Robust…
Multimodal human action understanding is a significant problem in computer vision, with the central challenge being the effective utilization of the complementarity among diverse modalities while maintaining model efficiency. However, most…
Human action understanding serves as a foundational pillar in the field of intelligent motion perception. Skeletons serve as a modality- and device-agnostic representation for human modeling, and skeleton-based action understanding has…
Skeleton-based action recognition has made great progress recently, but many problems still remain unsolved. For example, most of the previous methods model the representations of skeleton sequences without abundant spatial structure…
Sign language is commonly used by deaf or mute people to communicate but requires extensive effort to master. It is usually performed with the fast yet delicate movement of hand gestures, body posture, and even facial expressions. Current…
Benefiting from its succinctness and robustness, skeleton-based action recognition has recently attracted much attention. Most existing methods utilize local networks (e.g., recurrent, convolutional, and graph convolutional networks) to…
Unsupervised pre-training has shown great success in skeleton-based action understanding recently. Existing works typically train separate modality-specific models, then integrate the multi-modal information for action understanding by a…
The data-driven approach that learns an optimal representation of vision features like skeleton frames or RGB videos is currently a dominant paradigm for activity recognition. While great improvements have been achieved from existing single…
Recently skeleton-based action recognition has made signif-icant progresses in the computer vision community. Most state-of-the-art algorithms are based on Graph Convolutional Networks (GCN), andtarget at improving the network structure of…
Micro-Actions (MAs) are an important form of non-verbal communication in social interactions, with potential applications in human emotional analysis. However, existing methods in Micro-Action Recognition often overlook the inherent subtle…