English
Related papers

Related papers: Multi-modal Alignment using Representation Codeboo…

200 papers

Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream task. In this paper, we approach the document classification problem…

Computer Vision and Pattern Recognition · Computer Science 2023-05-12 Souhail Bakkali , Zuheng Ming , Mickael Coustaty , Marçal Rusiñol , Oriol Ramos Terrades

Vision-language pre-training (VLP) on large-scale image-text pairs has recently witnessed rapid progress for learning cross-modal representations. Existing pre-training methods either directly concatenate image representation and text…

Computation and Language · Computer Science 2021-03-16 Chenliang Li , Ming Yan , Haiyang Xu , Fuli Luo , Wei Wang , Bin Bi , Songfang Huang

Large-scale vision 2D vision language models, such as CLIP can be aligned with a 3D encoder to learn generalizable (open-vocabulary) 3D vision models. However, current methods require supervised pre-training for such alignment, and the…

Computer Vision and Pattern Recognition · Computer Science 2024-04-17 Amaya Dharmasiri , Muzammal Naseer , Salman Khan , Fahad Shahbaz Khan

Image clustering, which involves grouping images into different clusters without labels, is a key task in unsupervised learning. Although previous deep clustering methods have achieved remarkable results, they only explore the intrinsic…

Computer Vision and Pattern Recognition · Computer Science 2024-09-23 Haixin Zhang , Yongjun Li , Dong Huang

Vision-language models (VLMs) allow to embed texts and images in a shared representation space. However, it has been shown that these models are subject to a modality gap phenomenon meaning there exists a clear separation between the…

Computer Vision and Pattern Recognition · Computer Science 2025-05-07 François Role , Sébastien Meyer , Victor Amblard

Building joint representations across images and text is an essential step for tasks such as Visual Question Answering and Video Question Answering. In this work, we find that the representations must not only jointly capture features from…

Computer Vision and Pattern Recognition · Computer Science 2023-06-19 Vardaan Pahuja , AJ Piergiovanni , Anelia Angelova

We present our work in progress exploring the possibilities of a shared embedding space between textual and visual modality. Leveraging the textual nature of object detection labels and the hypothetical expressiveness of extracted visual…

Computer Vision and Pattern Recognition · Computer Science 2020-10-21 Dušan Variš , Katsuhito Sudoh , Satoshi Nakamura

Aligned text-image encoders such as CLIP have become the de facto model for vision-language tasks. Furthermore, modality-specific encoders achieve impressive performances in their respective domains. This raises a central question: does an…

Multi-view (or -modality) representation learning aims to understand the relationships between different view representations. Existing methods disentangle multi-view representations into consistent and view-specific representations by…

Computer Vision and Pattern Recognition · Computer Science 2023-08-07 Guanzhou Ke , Yang Yu , Guoqing Chao , Xiaoli Wang , Chenyang Xu , Shengfeng He

Recent advances in representation learning have demonstrated an ability to represent information from different modalities such as video, text, and audio in a single high-level embedding vector. In this work we present a self-supervised…

Computer Vision and Pattern Recognition · Computer Science 2021-06-11 Alexander H. Liu , SouYoung Jin , Cheng-I Jeff Lai , Andrew Rouditchenko , Aude Oliva , James Glass

Large multimodal models (LMMs) have gained impressive performance due to their outstanding capability in various understanding tasks. However, these models still suffer from some fundamental limitations related to robustness and…

Computer Vision and Pattern Recognition · Computer Science 2025-11-27 Thanh-Dat Truong , Huu-Thien Tran , Tran Thai Son , Bhiksha Raj , Khoa Luu

Visual Question Answering (VQA) attracts much attention from both industry and academia. As a multi-modality task, it is challenging since it requires not only visual and textual understanding, but also the ability to align cross-modality…

Computer Vision and Pattern Recognition · Computer Science 2022-01-27 Peixi Xiong , Quanzeng You , Pei Yu , Zicheng Liu , Ying Wu

Cross-modal retrieval is the task of retrieving samples of a given modality by using queries of a different one. Due to the wide range of practical applications, the problem has been mainly focused on the vision and language case, e.g. text…

Computer Vision and Pattern Recognition · Computer Science 2024-01-30 Jorge Sánchez , Rodrigo Laguna

Recently, the cross-modal pretraining model has been employed to produce meaningful pseudo-labels to supervise the training of an image clustering model. However, numerous erroneous alignments in a cross-modal pre-training model could…

Computer Vision and Pattern Recognition · Computer Science 2024-01-23 Liping Qiu , Qin Zhang , Xiaojun Chen , Shaotian Cai

Existing image-text modality alignment in Vision Language Models (VLMs) treats each text token equally in an autoregressive manner. Despite being simple and effective, this method results in sub-optimal cross-modal alignment by…

Computer Vision and Pattern Recognition · Computer Science 2024-11-06 Xin Xiao , Bohong Wu , Jiacong Wang , Chunyuan Li , Xun Zhou , Haoyuan Guo

Various state-of-the-art self-supervised visual representation learning approaches take advantage of data from multiple sensors by aligning the feature representations across views and/or modalities. In this work, we investigate how…

Computer Vision and Pattern Recognition · Computer Science 2022-11-28 Thomas M. Hehn , Julian F. P. Kooij , Dariu M. Gavrila

Most existing methods in vision-language retrieval match two modalities by either comparing their global feature vectors which misses sufficient information and lacks interpretability, detecting objects in images or videos and aligning the…

Computer Vision and Pattern Recognition · Computer Science 2022-10-04 Xiaohan Zou , Changqiao Wu , Lele Cheng , Zhongyuan Wang

An important goal of computer vision is to build systems that learn visual representations over time that can be applied to many tasks. In this paper, we investigate a vision-language embedding as a core representation and show that it…

Computer Vision and Pattern Recognition · Computer Science 2017-10-17 Tanmay Gupta , Kevin Shih , Saurabh Singh , Derek Hoiem

Prompt learning has become one of the most efficient paradigms for adapting large pre-trained vision-language models to downstream tasks. Current state-of-the-art methods, like CoOp and ProDA, tend to adopt soft prompts to learn an…

Computer Vision and Pattern Recognition · Computer Science 2023-03-31 Sifan Long , Zhen Zhao , Junkun Yuan , Zichang Tan , Jiangjiang Liu , Luping Zhou , Shengsheng Wang , Jingdong Wang

Although an object may appear in numerous contexts, we often describe it in a limited number of ways. Language allows us to abstract away visual variation to represent and communicate concepts. Building on this intuition, we propose an…

Computer Vision and Pattern Recognition · Computer Science 2023-03-30 Mohamed El Banani , Karan Desai , Justin Johnson
‹ Prev 1 2 3 10 Next ›