Related papers: Supervised cross-modal factor analysis for multipl…
Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream task. In this paper, we approach the document classification problem…
Computational food analysis (CFA) naturally requires multi-modal evidence of a particular food, e.g., images, recipe text, etc. A key to making CFA possible is multi-modal shared representation learning, which aims to create a joint…
In this paper, we propose a novel method for projecting data from multiple modalities to a new subspace optimized for one-class classification. The proposed method iteratively transforms the data from the original feature space of each…
Multi-modal semantic understanding requires integrating information from different modalities to extract users' real intention behind words. Most previous work applies a dual-encoder structure to separately encode image and text, but fails…
Integrating various data modalities brings valuable insights into underlying phenomena. Multimodal factor analysis (FA) uncovers shared axes of variation underlying different simple data modalities, where each sample is represented by a…
Learning multimodal representations is a fundamentally complex research problem due to the presence of multiple heterogeneous sources of information. Although the presence of multiple modalities provides additional valuable information,…
Aggregating multi-modality data to obtain reliable data representation attracts more and more attention. Recent studies demonstrate that Transformer models usually work well for multi-modality tasks. Existing Transformers generally either…
Metric-based meta-learning techniques have successfully been applied to few-shot classification problems. In this paper, we propose to leverage cross-modal information to enhance metric-based few-shot learning methods. Visual and semantic…
Concept Factorization (CF), as a novel paradigm of representation learning, has demonstrated superior performance in multi-view clustering tasks. It overcomes limitations such as the non-negativity constraint imposed by traditional matrix…
Fusing multi-modality information is known to be able to effectively bring significant improvement in video classification. However, the most popular method up to now is still simply fusing each stream's prediction scores at the last stage.…
Modern Web systems such as social media and e-commerce contain rich contents expressed in images and text. Leveraging information from multi-modalities can improve the performance of machine learning tasks such as classification and…
Most of the existing self-supervised feature learning methods for 3D data either learn 3D features from point cloud data or from multi-view images. By exploring the inherent multi-modality attributes of 3D objects, in this paper, we propose…
In this paper, we propose a novel semi-supervised feature selection framework by mining correlations among multiple tasks and apply it to different multimedia applications. Instead of independently computing the importance of features for…
The success of supervised learning requires large-scale ground truth labels which are very expensive, time-consuming, or may need special skills to annotate. To address this issue, many self- or un-supervised methods are developed. Unlike…
Multi-modal image fusion integrates complementary information from different modalities into a unified representation. Current methods predominantly optimize statistical correlations between modalities, often capturing dataset-induced…
With the rapid development of multimedia, the shift from unimodal textual sentiment analysis to multimodal image-text sentiment analysis has obtained academic and industrial attention in recent years. However, multimodal sentiment analysis…
Canonical Correlation Analysis (CCA) is widely used for multimodal data analysis and, more recently, for discriminative tasks such as multi-view learning; however, it makes no use of class labels. Recent CCA methods have started to address…
Multimodal encoders like CLIP excel in tasks such as zero-shot image classification and cross-modal retrieval. However, they require excessive training data. We propose canonical similarity analysis (CSA), which uses two unimodal encoders…
Cross-modal alignment is an effective approach to improving visual classification. Existing studies typically enforce a one-step mapping that uses deep neural networks to project the visual features to mimic the distribution of textual…
Modality representation learning is an important problem for multimodal sentiment analysis (MSA), since the highly distinguishable representations can contribute to improving the analysis effect. Previous works of MSA have usually focused…