Related papers: VIRTUE: Visual-Interactive Text-Image Universal Em…

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering. Recently, there has been a surge of interest in developing universal text embedding models that can…

Computer Vision and Pattern Recognition · Computer Science 2025-01-03 Ziyan Jiang , Rui Meng , Xinyi Yang , Semih Yavuz , Yingbo Zhou , Wenhu Chen

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME…

Computer Vision and Pattern Recognition · Computer Science 2025-07-08 Rui Meng , Ziyan Jiang , Ye Liu , Mingyi Su , Xinyi Yang , Yuepeng Fu , Can Qin , Zeyuan Chen , Ran Xu , Caiming Xiong , Yingbo Zhou , Wenhu Chen , Semih Yavuz

VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval

Multi-modal retrieval becomes increasingly popular in practice. However, the existing retrievers are mostly text-oriented, which lack the capability to process visual information. Despite the presence of vision-language models like CLIP,…

Information Retrieval · Computer Science 2024-06-07 Junjie Zhou , Zheng Liu , Shitao Xiao , Bo Zhao , Yongping Xiong

UNITER: UNiversal Image-TExt Representation Learning

Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal…

Computer Vision and Pattern Recognition · Computer Science 2020-07-21 Yen-Chun Chen , Linjie Li , Licheng Yu , Ahmed El Kholy , Faisal Ahmed , Zhe Gan , Yu Cheng , Jingjing Liu

VICTR: Visual Information Captured Text Representation for Text-to-Image Multimodal Tasks

Text-to-image multimodal tasks, generating/retrieving an image from a given text description, are extremely challenging tasks since raw text descriptions cover quite limited information in order to fully describe visually realistic images.…

Computer Vision and Pattern Recognition · Computer Science 2020-10-27 Soyeon Caren Han , Siqu Long , Siwen Luo , Kunze Wang , Josiah Poon

USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation

The open-vocabulary image segmentation task involves partitioning images into semantically meaningful segments and classifying them with flexible text-defined categories. The recent vision-based foundation models such as the Segment…

Computer Vision and Pattern Recognition · Computer Science 2024-06-11 Xiaoqi Wang , Wenbin He , Xiwei Xuan , Clint Sebastian , Jorge Piazentin Ono , Xin Li , Sima Behpour , Thang Doan , Liang Gou , Han Wei Shen , Liu Ren

VicTR: Video-conditioned Text Representations for Activity Recognition

Vision-Language models (VLMs) have excelled in the image-domain -- especially in zero-shot settings -- thanks to the availability of vast pretraining data (i.e., paired image-text samples). However for videos, such paired data is not as…

Computer Vision and Pattern Recognition · Computer Science 2024-04-01 Kumara Kahatapitiya , Anurag Arnab , Arsha Nagrani , Michael S. Ryoo

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Visual-semantic embedding aims to find a shared latent space where related visual and textual instances are close to each other. Most current methods learn injective embedding functions that map an instance to a single point in the shared…

Computer Vision and Pattern Recognition · Computer Science 2019-07-18 Yale Song , Mohammad Soleymani

ObjEmbed: Towards Universal Multimodal Object Embeddings

Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often…

Computer Vision and Pattern Recognition · Computer Science 2026-02-04 Shenghao Fu , Yukun Su , Fengyun Rao , Jing Lyu , Xiaohua Xie , Wei-Shi Zheng

Probing Multimodal Embeddings for Linguistic Properties: the Visual-Semantic Case

Semantic embeddings have advanced the state of the art for countless natural language processing tasks, and various extensions to multimodal domains, such as visual-semantic embeddings, have been proposed. While the power of visual-semantic…

Machine Learning · Computer Science 2021-02-23 Adam Dahlgren Lindström , Suna Bensch , Johanna Björklund , Frank Drewes

Fill the Gap: Quantifying and Reducing the Modality Gap in Image-Text Representation Learning

Vision-language models (VLMs) allow to embed texts and images in a shared representation space. However, it has been shown that these models are subject to a modality gap phenomenon meaning there exists a clear separation between the…

Computer Vision and Pattern Recognition · Computer Science 2025-05-07 François Role , Sébastien Meyer , Victor Amblard

Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception

Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reasoning tasks by combining visual representations with the abstract skill set large language models (LLMs) learn during pretraining. Vision,…

Artificial Intelligence · Computer Science 2023-09-01 Riley Tavassoli , Mani Amani , Reza Akhavian

VU-BERT: A Unified framework for Visual Dialog

The visual dialog task attempts to train an agent to answer multi-turn questions given an image, which requires the deep understanding of interactions between the image and dialog history. Existing researches tend to employ the…

Computation and Language · Computer Science 2022-02-23 Tong Ye , Shijing Si , Jianzong Wang , Rui Wang , Ning Cheng , Jing Xiao

Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

Joint understanding of video and language is an active research area with many applications. Prior work in this domain typically relies on learning text-video embeddings. One difficulty with this approach, however, is the lack of…

Computer Vision and Pattern Recognition · Computer Science 2020-01-17 Antoine Miech , Ivan Laptev , Josef Sivic

Beyond Pixels: Introducing Geometric-Semantic World Priors for Video-based Embodied Models via Spatio-temporal Alignment

Achieving human-like reasoning in deep learning models for complex tasks in unknown environments remains a critical challenge in embodied intelligence. While advanced vision-language models (VLMs) excel in static scene understanding, their…

Computer Vision and Pattern Recognition · Computer Science 2025-09-03 Jinzhou Tang , Jusheng zhang , Sidi Liu , Waikit Xiu , Qinhan Lv , Xiying Li

VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval

Recent studies have adapted generative Multimodal Large Language Models (MLLMs) into embedding extractors for vision tasks, typically through fine-tuning to produce universal representations. However, their performance on video remains…

Computer Vision and Pattern Recognition · Computer Science 2026-02-10 Issar Tzachor , Dvir Samuel , Rami Ben-Ari

Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key…

Computer Vision and Pattern Recognition · Computer Science 2025-12-09 Tiancheng Gu , Kaicheng Yang , Ziyong Feng , Xingjun Wang , Yanzhao Zhang , Dingkun Long , Yingda Chen , Weidong Cai , Jiankang Deng

VCR: A Task for Pixel-Level Complex Reasoning in Vision Language Models via Restoring Occluded Text

We introduce Visual Caption Restoration (VCR), a novel vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images. This task stems from the observation that text embedded…

Computer Vision and Pattern Recognition · Computer Science 2025-04-21 Tianyu Zhang , Suyuchen Wang , Lu Li , Ge Zhang , Perouz Taslakian , Sai Rajeswar , Jie Fu , Bang Liu , Yoshua Bengio

UNIT: Unifying Image and Text Recognition in One Vision Encoder

Currently, vision encoder models like Vision Transformers (ViTs) typically excel at image recognition tasks but cannot simultaneously support text recognition like human visual recognition. To address this limitation, we propose UNIT, a…

Computer Vision and Pattern Recognition · Computer Science 2024-09-09 Yi Zhu , Yanpeng Zhou , Chunwei Wang , Yang Cao , Jianhua Han , Lu Hou , Hang Xu

MuMUR : Multilingual Multimodal Universal Retrieval

Multi-modal retrieval has seen tremendous progress with the development of vision-language models. However, further improving these models require additional labelled data which is a huge manual effort. In this paper, we propose a framework…

Computer Vision and Pattern Recognition · Computer Science 2023-09-26 Avinash Madasu , Estelle Aflalo , Gabriela Ben Melech Stan , Shachar Rosenman , Shao-Yen Tseng , Gedas Bertasius , Vasudev Lal