Related papers: Multi-Modal Coreference Resolution with the Correl…

Semi-supervised multimodal coreference resolution in image narrations

In this paper, we study multimodal coreference resolution, specifically where a longer descriptive text, i.e., a narration is paired with an image. This poses significant challenges due to fine-grained image-text alignment, inherent…

Computation and Language · Computer Science 2023-10-23 Arushi Goel , Basura Fernando , Frank Keller , Hakan Bilen

Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions

Multimodal deep learning systems which employ multiple modalities like text, image, audio, video, etc., are showing better performance in comparison with individual modalities (i.e., unimodal) systems. Multimodal machine learning involves…

Machine Learning · Computer Science 2022-01-19 Anil Rahate , Rahee Walambe , Sheela Ramanna , Ketan Kotecha

Temporal Cross-Media Retrieval with Soft-Smoothing

Multimedia information have strong temporal correlations that shape the way modalities co-occur over time. In this paper we study the dynamic nature of multimedia and social-media information, where the temporal dimension emerges as a…

Multimedia · Computer Science 2018-10-11 David Semedo , João Magalhães

Learning Shared Cross-modality Representation Using Multispectral-LiDAR and Hyperspectral Data

Due to the ever-growing diversity of the data source, multi-modality feature learning has attracted more and more attention. However, most of these methods are designed by jointly learning feature representation from multi-modalities that…

Computer Vision and Pattern Recognition · Computer Science 2020-06-09 Danfeng Hong , Jocelyn Chanussot , Naoto Yokoya , Jian Kang , Xiao Xiang Zhu

Semi-supervised Classification using Attention-based Regularization on Coarse-resolution Data

Many real-world phenomena are observed at multiple resolutions. Predictive models designed to predict these phenomena typically consider different resolutions separately. This approach might be limiting in applications where predictions are…

Machine Learning · Computer Science 2020-01-07 Guruprasad Nayak , Rahul Ghosh , Xiaowei Jia , Varun Mithal , Vipin Kumar

Cross-Modal Learning via Pairwise Constraints

In multimedia applications, the text and image components in a web document form a pairwise constraint that potentially indicates the same semantic concept. This paper studies cross-modal learning via the pairwise constraint, and aims to…

Computer Vision and Pattern Recognition · Computer Science 2023-07-19 Ran He , Man Zhang , Liang Wang , Ye Ji , Qiyue Yin

Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic Structures

Multimodal reference resolution, including phrase grounding, aims to understand the semantic relations between mentions and real-world objects. Phrase grounding between images and their captions is a well-established task. In contrast, for…

Computation and Language · Computer Science 2025-06-03 Shun Inadumi , Nobuhiro Ueda , Koichiro Yoshino

Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval

The abundance of multimodal data (e.g. social media posts) has inspired interest in cross-modal retrieval methods. Popular approaches rely on a variety of metric learning losses, which prescribe what the proximity of image and text should…

Computer Vision and Pattern Recognition · Computer Science 2020-09-24 Christopher Thomas , Adriana Kovashka

Robust Semi-supervised Multimodal Medical Image Segmentation via Cross Modality Collaboration

Multimodal learning leverages complementary information derived from different modalities, thereby enhancing performance in medical image segmentation. However, prevailing multimodal learning methods heavily rely on extensive well-annotated…

Computer Vision and Pattern Recognition · Computer Science 2024-09-05 Xiaogen Zhou , Yiyou Sun , Min Deng , Winnie Chiu Wing Chu , Qi Dou

Simple to Complex Cross-modal Learning to Rank

The heterogeneity-gap between different modalities brings a significant challenge to multimedia information retrieval. Some studies formalize the cross-modal retrieval tasks as a ranking problem and learn a shared multi-modal embedding…

Machine Learning · Computer Science 2017-07-11 Minnan Luo , Xiaojun Chang , Zhihui Li , Liqiang Nie , Alexander G. Hauptmann , Qinghua Zheng

Cross-Modal Coordination Across a Diverse Set of Input Modalities

Cross-modal retrieval is the task of retrieving samples of a given modality by using queries of a different one. Due to the wide range of practical applications, the problem has been mainly focused on the vision and language case, e.g. text…

Computer Vision and Pattern Recognition · Computer Science 2024-01-30 Jorge Sánchez , Rodrigo Laguna

Multimodal Representation Alignment for Cross-modal Information Retrieval

Different machine learning models can represent the same underlying concept in different ways. This variability is particularly valuable for in-the-wild multimodal retrieval, where the objective is to identify the corresponding…

Information Retrieval · Computer Science 2025-06-11 Fan Xu , Luis A. Leiva

Learning Unseen Modality Interaction

Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive…

Computer Vision and Pattern Recognition · Computer Science 2023-10-26 Yunhua Zhang , Hazel Doughty , Cees G. M. Snoek

Learning from Multiview Correlations in Open-Domain Videos

An increasing number of datasets contain multiple views, such as video, sound and automatic captions. A basic challenge in representation learning is how to leverage multiple views to learn better representations. This is further…

Machine Learning · Computer Science 2019-03-04 Nils Holzenberger , Shruti Palaskar , Pranava Madhyastha , Florian Metze , Raman Arora

MAJORScore: A Novel Metric for Evaluating Multimodal Relevance via Joint Representation

The multimodal relevance metric is usually borrowed from the embedding ability of pretrained contrastive learning models for bimodal data, which is used to evaluate the correlation between cross-modal data (e.g., CLIP). However, the…

Computer Vision and Pattern Recognition · Computer Science 2025-09-29 Zhicheng Du , Qingyang Shi , Jiasheng Lu , Yingshan Liang , Xinyu Zhang , Yiran Wang , Peiwu Qin

Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model

Multimodal language models (MLLMs) are increasingly being applied in real-world environments, necessitating their ability to interpret 3D spaces and comprehend temporal dynamics. Current methods often rely on specialized architectural…

Computer Vision and Pattern Recognition · Computer Science 2024-11-22 Benlin Liu , Yuhao Dong , Yiqin Wang , Zixian Ma , Yansong Tang , Luming Tang , Yongming Rao , Wei-Chiu Ma , Ranjay Krishna

Learning Discriminative Representations for Semantic Cross Media Retrieval

Heterogeneous gap among different modalities emerges as one of the critical issues in modern AI problems. Unlike traditional uni-modal cases, where raw features are extracted and directly measured, the heterogeneous nature of cross modal…

Information Retrieval · Computer Science 2015-11-19 Aiwen Jiang , Hanxi Li , Yi Li , Mingwen Wang

The cross-media retrieval problem has received much attention in recent years due to the rapid increasing of multimedia data on the Internet. A new approach to the problem has been raised which intends to match features of different…

Multimedia · Computer Science 2015-12-18 Cuicui Kang , Shengcai Liao , Yonghao He , Jian Wang , Wenjia Niu , Shiming Xiang , Chunhong Pan

Learning Multi-modal Similarity

In many applications involving multi-media data, the definition of similarity between items is integral to several key tasks, e.g., nearest-neighbor retrieval, classification, and recommendation. Data in such regimes typically exhibits…

Artificial Intelligence · Computer Science 2010-09-01 Brian McFee , Gert Lanckriet

Multimodal Contrastive Training for Visual Representation Learning

We develop an approach to learning visual representations that embraces multimodal data, driven by a combination of intra- and inter-modal similarity preservation objectives. Unlike existing visual pre-training methods, which solve a proxy…

Computer Vision and Pattern Recognition · Computer Science 2021-04-28 Xin Yuan , Zhe Lin , Jason Kuen , Jianming Zhang , Yilin Wang , Michael Maire , Ajinkya Kale , Baldo Faieta