Related papers: Deep Multi-Modal Sets
Multi-modal learning is a fast growing area in artificial intelligence. It tries to help machines understand complex things by combining information from different sources, like images, text, and audio. By using the strengths of each…
The world provides us with data of multiple modalities. Intuitively, models fusing data from different modalities outperform their uni-modal counterparts, since more information is aggregated. Recently, joining the success of deep learning,…
Multiple modalities can provide more valuable information than single one by describing the same contents in various ways. Hence, it is highly expected to learn effective joint representation by fusing the features of different modalities.…
The goal of multi-modal learning is to use complimentary information on the relevant task provided by the multiple modalities to achieve reliable and robust performance. Recently, deep learning has led significant improvement in multi-modal…
We propose a compact and effective framework to fuse multimodal features at multiple layers in a single network. The framework consists of two innovative fusion schemes. Firstly, unlike existing multimodal methods that necessitate…
Deep learning methods have revolutionized speech recognition, image recognition, and natural language processing since 2010. Each of these tasks involves a single modality in their input signals. However, many applications in the artificial…
Developing effective multimodal data fusion strategies has become increasingly essential for improving the predictive power of statistical machine learning methods across a wide range of applications, from autonomous driving to medical…
We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions.…
Multimodal learning typically relies on the assumption that all modalities are fully available during both the training and inference phases. However, in real-world scenarios, consistently acquiring complete multimodal data presents…
Learning multi-modal representations is an essential step towards real-world robotic applications, and various multi-modal fusion models have been developed for this purpose. However, we observe that existing models, whose objectives are…
Recent advances in representation learning have demonstrated an ability to represent information from different modalities such as video, text, and audio in a single high-level embedding vector. In this work we present a self-supervised…
With the development of web technology, multi-modal or multi-view data has surged as a major stream for big data, where each modal/view encodes individual property of data objects. Often, different modalities are complementary to each…
We aim to develop a fundamental understanding of modality collapse, a recently observed empirical phenomenon wherein models trained for multimodal fusion tend to rely only on a subset of the modalities, ignoring the rest. We show that…
Creating a meaningful representation by fusing single modalities (e.g., text, images, or audio) is the core concept of multimodal learning. Although several techniques for building multimodal representations have been proven successful,…
Developing effective multimodal fusion approaches has become increasingly essential in many real-world scenarios, such as health care and finance. The key challenge is how to preserve the feature expressiveness in each modality while…
In this thesis, we develop various techniques for working with sets in machine learning. Each input or output is not an image or a sequence, but a set: an unordered collection of multiple objects, each object described by a feature vector.…
Materials science datasets are inherently heterogeneous and are available in different modalities such as characterization spectra, atomic structures, microscopic images, and text-based synthesis conditions. The advancements in multi-modal…
This paper proposes a learning model, based on rank-fusion graphs, for general applicability in multimodal prediction tasks, such as multimodal regression and image classification. Rank-fusion graphs encode information from multiple…
Multimodality Representation Learning, as a technique of learning to embed information from different modalities and their correlations, has achieved remarkable success on a variety of applications, such as Visual Question Answering (VQA),…
Continual learning aims to learn knowledge of tasks observed in sequential time steps while mitigating the forgetting of previously learned knowledge. Existing methods were designed to learn a single modality (e.g., image) over time, which…