Related papers: Multi-modal Alignment using Representation Codeboo…

VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification

Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream task. In this paper, we approach the document classification problem…

Computer Vision and Pattern Recognition · Computer Science 2023-05-12 Souhail Bakkali , Zuheng Ming , Mickael Coustaty , Marçal Rusiñol , Oriol Ramos Terrades

SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels

Vision-language pre-training (VLP) on large-scale image-text pairs has recently witnessed rapid progress for learning cross-modal representations. Existing pre-training methods either directly concatenate image representation and text…

Computation and Language · Computer Science 2021-03-16 Chenliang Li , Ming Yan , Haiyang Xu , Fuli Luo , Wei Wang , Bin Bi , Songfang Huang

Cross-Modal Self-Training: Aligning Images and Pointclouds to Learn Classification without Labels

Large-scale vision 2D vision language models, such as CLIP can be aligned with a 3D encoder to learn generalizable (open-vocabulary) 3D vision models. However, current methods require supervised pre-training for such alignment, and the…

Computer Vision and Pattern Recognition · Computer Science 2024-04-17 Amaya Dharmasiri , Muzammal Naseer , Salman Khan , Fahad Shahbaz Khan

Dual-Level Cross-Modal Contrastive Clustering

Image clustering, which involves grouping images into different clusters without labels, is a key task in unsupervised learning. Although previous deep clustering methods have achieved remarkable results, they only explore the intrinsic…

Computer Vision and Pattern Recognition · Computer Science 2024-09-23 Haixin Zhang , Yongjun Li , Dong Huang

Fill the Gap: Quantifying and Reducing the Modality Gap in Image-Text Representation Learning

Vision-language models (VLMs) allow to embed texts and images in a shared representation space. However, it has been shown that these models are subject to a modality gap phenomenon meaning there exists a clear separation between the…

Computer Vision and Pattern Recognition · Computer Science 2025-05-07 François Role , Sébastien Meyer , Victor Amblard

Diversifying Joint Vision-Language Tokenization Learning

Building joint representations across images and text is an essential step for tasks such as Visual Question Answering and Video Question Answering. In this work, we find that the representations must not only jointly capture features from…

Computer Vision and Pattern Recognition · Computer Science 2023-06-19 Vardaan Pahuja , AJ Piergiovanni , Anelia Angelova

Image Captioning with Visual Object Representations Grounded in the Textual Modality

We present our work in progress exploring the possibilities of a shared embedding space between textual and visual modality. Leveraging the textual nature of object detection labels and the hypothetical expressiveness of extracted visual…

Computer Vision and Pattern Recognition · Computer Science 2020-10-21 Dušan Variš , Katsuhito Sudoh , Satoshi Nakamura

Do Vision and Language Encoders Represent the World Similarly?

Aligned text-image encoders such as CLIP have become the de facto model for vision-language tasks. Furthermore, modality-specific encoders achieve impressive performances in their respective domains. This raises a central question: does an…

Computer Vision and Pattern Recognition · Computer Science 2024-03-26 Mayug Maniparambil , Raiymbek Akshulakov , Yasser Abdelaziz Dahou Djilali , Sanath Narayan , Mohamed El Amine Seddik , Karttikeya Mangalam , Noel E. O'Connor

Disentangling Multi-view Representations Beyond Inductive Bias

Multi-view (or -modality) representation learning aims to understand the relationships between different view representations. Existing methods disentangle multi-view representations into consistent and view-specific representations by…

Computer Vision and Pattern Recognition · Computer Science 2023-08-07 Guanzhou Ke , Yang Yu , Guoqing Chao , Xiaoli Wang , Chenyang Xu , Shengfeng He

Cross-Modal Discrete Representation Learning

Recent advances in representation learning have demonstrated an ability to represent information from different modalities such as video, text, and audio in a single high-level embedding vector. In this work we present a self-supervised…

Computer Vision and Pattern Recognition · Computer Science 2021-06-11 Alexander H. Liu , SouYoung Jin , Cheng-I Jeff Lai , Andrew Rouditchenko , Aude Oliva , James Glass

Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models

Large multimodal models (LMMs) have gained impressive performance due to their outstanding capability in various understanding tasks. However, these models still suffer from some fundamental limitations related to robustness and…

Computer Vision and Pattern Recognition · Computer Science 2025-11-27 Thanh-Dat Truong , Huu-Thien Tran , Tran Thai Son , Bhiksha Raj , Khoa Luu

SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering

Visual Question Answering (VQA) attracts much attention from both industry and academia. As a multi-modality task, it is challenging since it requires not only visual and textual understanding, but also the ability to align cross-modality…

Computer Vision and Pattern Recognition · Computer Science 2022-01-27 Peixi Xiong , Quanzeng You , Pei Yu , Zicheng Liu , Ying Wu

Cross-Modal Coordination Across a Diverse Set of Input Modalities

Cross-modal retrieval is the task of retrieving samples of a given modality by using queries of a different one. Due to the wide range of practical applications, the problem has been mainly focused on the vision and language case, e.g. text…

Computer Vision and Pattern Recognition · Computer Science 2024-01-30 Jorge Sánchez , Rodrigo Laguna

Multi-level Cross-modal Alignment for Image Clustering

Recently, the cross-modal pretraining model has been employed to produce meaningful pseudo-labels to supervise the training of an image clustering model. However, numerous erroneous alignments in a cross-modal pre-training model could…

Computer Vision and Pattern Recognition · Computer Science 2024-01-23 Liping Qiu , Qin Zhang , Xiaojun Chen , Shaotian Cai

Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment

Existing image-text modality alignment in Vision Language Models (VLMs) treats each text token equally in an autoregressive manner. Despite being simple and effective, this method results in sub-optimal cross-modal alignment by…

Computer Vision and Pattern Recognition · Computer Science 2024-11-06 Xin Xiao , Bohong Wu , Jiacong Wang , Chunyuan Li , Xun Zhou , Haoyuan Guo

How do Cross-View and Cross-Modal Alignment Affect Representations in Contrastive Learning?

Various state-of-the-art self-supervised visual representation learning approaches take advantage of data from multiple sensors by aligning the feature representations across views and/or modalities. In this work, we investigate how…

Computer Vision and Pattern Recognition · Computer Science 2022-11-28 Thomas M. Hehn , Julian F. P. Kooij , Dariu M. Gavrila

TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval

Most existing methods in vision-language retrieval match two modalities by either comparing their global feature vectors which misses sufficient information and lacks interpretability, detecting objects in images or videos and aligning the…

Computer Vision and Pattern Recognition · Computer Science 2022-10-04 Xiaohan Zou , Changqiao Wu , Lele Cheng , Zhongyuan Wang

Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks

An important goal of computer vision is to build systems that learn visual representations over time that can be applied to many tasks. In this paper, we investigate a vision-language embedding as a core representation and show that it…

Computer Vision and Pattern Recognition · Computer Science 2017-10-17 Tanmay Gupta , Kevin Shih , Saurabh Singh , Derek Hoiem

Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models

Prompt learning has become one of the most efficient paradigms for adapting large pre-trained vision-language models to downstream tasks. Current state-of-the-art methods, like CoOp and ProDA, tend to adopt soft prompts to learn an…

Computer Vision and Pattern Recognition · Computer Science 2023-03-31 Sifan Long , Zhen Zhao , Junkun Yuan , Zichang Tan , Jiangjiang Liu , Luping Zhou , Shengsheng Wang , Jingdong Wang

Learning Visual Representations via Language-Guided Sampling

Although an object may appear in numerous contexts, we often describe it in a limited number of ways. Language allows us to abstract away visual variation to represent and communicate concepts. Building on this intuition, we propose an…

Computer Vision and Pattern Recognition · Computer Science 2023-03-30 Mohamed El Banani , Karan Desai , Justin Johnson