Related papers: Mutual Modality Learning for Video Action Classifi…
Multimodal video understanding plays a crucial role in tasks such as action recognition and emotion classification by combining information from different modalities. However, multimodal models are prone to overfitting strong modalities,…
Multi-modal machine learning (ML) models can process data in multiple modalities (e.g., video, audio, text) and are useful for video content analysis in a variety of problems (e.g., object detection, scene understanding, activity…
Learning multi-modal representations is an essential step towards real-world robotic applications, and various multi-modal fusion models have been developed for this purpose. However, we observe that existing models, whose objectives are…
This paper presents the first-rank solution for the Multi-Modal Action Recognition Challenge, part of the Multi-Modal Visual Pattern Recognition Workshop at the \acl{ICPR} 2024. The competition aimed to recognize human actions using a…
Diverse input data modalities can provide complementary cues for several tasks, usually leading to more robust algorithms and better performance. However, while a (training) dataset could be accurately designed to include a variety of…
While the field of multi-modal learning keeps growing fast, the deficiency of the standard joint training paradigm has become clear through recent studies. They attribute the sub-optimal performance of the jointly trained model to the…
Continual learning aims to learn knowledge of tasks observed in sequential time steps while mitigating the forgetting of previously learned knowledge. Existing methods were designed to learn a single modality (e.g., image) over time, which…
Multimodal learning (MML) aims to jointly exploit the common priors of different modalities to compensate for their inherent limitations. However, existing MML methods often optimize a uniform objective for different modalities, leading to…
Ensembles of classifier models typically deliver superior performance and can outperform single classifier models given a dataset and classification task at hand. However, the gain in performance comes together with the lack in…
Learning from multiple modalities, such as audio and video, offers opportunities for leveraging complementary information, enhancing robustness, and improving contextual understanding and performance. However, combining such modalities…
Multi-modal learning has achieved remarkable success by integrating information from various modalities, achieving superior performance in tasks like recognition and retrieval compared to uni-modal approaches. However, real-world scenarios…
Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we…
In this work, we address the problem of learning an ensemble of specialist networks using multimodal data, while considering the realistic and challenging scenario of possible missing modalities at test time. Our goal is to leverage the…
The merit of ensemble learning lies in having different outputs from many individual models on a single input, i.e., the diversity of the base models. The high quality of diversity can be achieved when each model is specialized to different…
We learn about the world from a diverse range of sensory information. Automated systems lack this ability as investigation has centred on processing information presented in a single form. Adapting architectures to learn from multiple…
One primary topic of multimodal learning is to jointly incorporate heterogeneous information from different modalities. However most models often suffer from unsatisfactory multimodal cooperation which cannot jointly utilize all modalities…
Single modality action recognition on RGB or depth sequences has been extensively explored recently. It is generally accepted that each of these two modalities has different strengths and limitations for the task of action recognition.…
In multi-modal action recognition, it is important to consider not only the complementary nature of different modalities but also global action content. In this paper, we propose a novel network, named Modality Mixer (M-Mixer) network, to…
Consider end-to-end training of a multi-modal vs. a single-modal network on a task with multiple input modalities: the multi-modal network receives more information, so it should match or outperform its single-modal counterpart. In our…
Multimodal deep learning, especially vision-language models, have gained significant traction in recent years, greatly improving performance on many downstream tasks, including content moderation and violence detection. However, standard…