Related papers: Preserving Modality Structure Improves Multi-Modal…

Continual Learning for Multiple Modalities

Continual learning aims to learn knowledge of tasks observed in sequential time steps while mitigating the forgetting of previously learned knowledge. Existing methods were designed to learn a single modality (e.g., image) over time, which…

Computer Vision and Pattern Recognition · Computer Science 2025-08-15 Hyundong Jin , Eunwoo Kim

Semantic Compression via Multimodal Representation Learning

Multimodal representation learning produces high-dimensional embeddings that align diverse modalities in a shared latent space. While this enables strong generalization, it also introduces scalability challenges, both in terms of storage…

Machine Learning · Computer Science 2025-09-30 Eleonora Grassucci , Giordano Cicchetti , Aurelio Uncini , Danilo Comminiello

Learning Relative Representations for Fine-Grained Multimodal Alignment with Limited Data

Multimodal pre-training demonstrates strong generalization performance, but this paradigm is often impractical in domains where paired data are scarce. A promising alternative is post-hoc multimodal alignment, which aligns separately…

Computer Vision and Pattern Recognition · Computer Science 2026-05-19 Shiwon Kim , Yu Rang Park

MCA: Modality Composition Awareness for Robust Composed Multimodal Retrieval

Multimodal retrieval, which seeks to retrieve relevant content across modalities such as text or image, supports applications from AI search to contents production. Despite the success of separate-encoder approaches like CLIP align…

Computation and Language · Computer Science 2025-10-20 Qiyu Wu , Shuyang Cui , Satoshi Hayakawa , Wei-Yao Wang , Hiromi Wakaki , Yuki Mitsufuji

Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations

A unified representation space in multi-modal learning is essential for effectively integrating diverse data sources, such as text, images, and audio, to enhance efficiency and performance across various downstream tasks. Recent binding…

Machine Learning · Computer Science 2025-10-08 Minoh Jeong , Zae Myung Kim , Min Namgung , Dongyeop Kang , Yao-Yi Chiang , Alfred Hero

Data similarity is a key concept in many data-driven applications. Many algorithms are sensitive to similarity measures. To tackle this fundamental problem, automatically learning of similarity information from data via self-expression has…

Machine Learning · Computer Science 2019-03-12 Zhao Kang , Yiwei Lu , Yuanzhang Su , Changsheng Li , Zenglin Xu

Multimodal Self-Supervised Learning for Medical Image Analysis

Self-supervised learning approaches leverage unlabeled samples to acquire generic knowledge about different concepts, hence allowing for annotation-efficient downstream task learning. In this paper, we propose a novel self-supervised method…

Computer Vision and Pattern Recognition · Computer Science 2020-10-27 Aiham Taleb , Christoph Lippert , Tassilo Klein , Moin Nabi

Towards Improving Embedding Based Models of Social Network Alignment via Pseudo Anchors

Social network alignment aims at aligning person identities across social networks. Embedding based models have been shown effective for the alignment where the structural proximity preserving objective is typically adopted for the model…

Social and Information Networks · Computer Science 2021-11-23 Zihan Yan , Li Liu , Xin Li , William K. Cheung , Youmin Zhang , Qun Liu , Guoyin Wang

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities. In this context, this paper…

Computer Vision and Pattern Recognition · Computer Science 2021-10-18 Brian Chen , Andrew Rouditchenko , Kevin Duarte , Hilde Kuehne , Samuel Thomas , Angie Boggust , Rameswar Panda , Brian Kingsbury , Rogerio Feris , David Harwath , James Glass , Michael Picheny , Shih-Fu Chang

An Analysis of Semantically-Aligned Speech-Text Embeddings

Embeddings play an important role in end-to-end solutions for multi-modal language processing problems. Although there has been some effort to understand the properties of single-modality embedding spaces, particularly that of text, their…

Computation and Language · Computer Science 2023-01-20 Muhammad Huzaifah , Ivan Kukanov

Learning Modality-Invariant Representations for Speech and Images

In this paper, we explore the unsupervised learning of a semantic embedding space for co-occurring sensory inputs. Specifically, we focus on the task of learning a semantic vector space for both spoken and handwritten digits using the…

Machine Learning · Computer Science 2017-12-12 Kenneth Leidal , David Harwath , James Glass

Deep Multi-Modal Sets

Many vision-related tasks benefit from reasoning over multiple modalities to leverage complementary views of data in an attempt to learn robust embedding spaces. Most deep learning-based methods rely on a late fusion technique whereby…

Computer Vision and Pattern Recognition · Computer Science 2020-03-04 Austin Reiter , Menglin Jia , Pu Yang , Ser-Nam Lim

SwAMP: Swapped Assignment of Multi-Modal Pairs for Cross-Modal Retrieval

We tackle the cross-modal retrieval problem, where learning is only supervised by relevant multi-modal pairs in the data. Although the contrastive learning is the most popular approach for this task, it makes potentially wrong assumption…

Machine Learning · Computer Science 2022-10-13 Minyoung Kim

Can multimodal representation learning by alignment preserve modality-specific information?

Combining multimodal data is a key issue in a wide range of machine learning tasks, including many remote sensing problems. In Earth observation, early multimodal data fusion methods were based on specific neural network architectures and…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Romain Thoreau , Jessie Levillain , Dawa Derksen

AnchorNet: A Weakly Supervised Network to Learn Geometry-sensitive Features For Semantic Matching

Despite significant progress of deep learning in recent years, state-of-the-art semantic matching methods still rely on legacy features such as SIFT or HoG. We argue that the strong invariance properties that are key to the success of…

Computer Vision and Pattern Recognition · Computer Science 2017-04-18 David Novotny , Diane Larlus , Andrea Vedaldi

Multi-Modal Continual Learning via Cross-Modality Adapters and Representation Alignment with Knowledge Preservation

Continual learning is essential for adapting models to new tasks while retaining previously acquired knowledge. While existing approaches predominantly focus on uni-modal data, multi-modal learning offers substantial benefits by utilizing…

Machine Learning · Computer Science 2025-11-11 Evelyn Chee , Wynne Hsu , Mong Li Lee

Self-Supervised Multimodal Learning: A Survey

Multimodal learning, which aims to understand and analyze information from multiple modalities, has achieved substantial progress in the supervised regime in recent years. However, the heavy dependence on data paired with expensive human…

Machine Learning · Computer Science 2024-08-19 Yongshuo Zong , Oisin Mac Aodha , Timothy Hospedales

Multi-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment

Multi-modal semantic understanding requires integrating information from different modalities to extract users' real intention behind words. Most previous work applies a dual-encoder structure to separately encode image and text, but fails…

Computation and Language · Computer Science 2024-03-12 Ming Zhang , Ke Chang , Yunfang Wu

Self-Supervised Model Adaptation for Multimodal Semantic Segmentation

Learning to reliably perceive and understand the scene is an integral enabler for robots to operate in the real-world. This problem is inherently challenging due to the multitude of object types as well as appearance changes caused by…

Computer Vision and Pattern Recognition · Computer Science 2021-11-05 Abhinav Valada , Rohit Mohan , Wolfram Burgard

Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies

Learning continuous representations of discrete objects such as text, users, movies, and URLs lies at the heart of many applications including language and user modeling. When using discrete objects as input to neural networks, we often…

Machine Learning · Computer Science 2021-03-12 Paul Pu Liang , Manzil Zaheer , Yuan Wang , Amr Ahmed