English
Related papers

Related papers: VIRTUE: Visual-Interactive Text-Image Universal Em…

200 papers

Embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering. Recently, there has been a surge of interest in developing universal text embedding models that can…

Computer Vision and Pattern Recognition · Computer Science 2025-01-03 Ziyan Jiang , Rui Meng , Xinyi Yang , Semih Yavuz , Yingbo Zhou , Wenhu Chen

Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME…

Computer Vision and Pattern Recognition · Computer Science 2025-07-08 Rui Meng , Ziyan Jiang , Ye Liu , Mingyi Su , Xinyi Yang , Yuepeng Fu , Can Qin , Zeyuan Chen , Ran Xu , Caiming Xiong , Yingbo Zhou , Wenhu Chen , Semih Yavuz

Multi-modal retrieval becomes increasingly popular in practice. However, the existing retrievers are mostly text-oriented, which lack the capability to process visual information. Despite the presence of vision-language models like CLIP,…

Information Retrieval · Computer Science 2024-06-07 Junjie Zhou , Zheng Liu , Shitao Xiao , Bo Zhao , Yongping Xiong

Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal…

Computer Vision and Pattern Recognition · Computer Science 2020-07-21 Yen-Chun Chen , Linjie Li , Licheng Yu , Ahmed El Kholy , Faisal Ahmed , Zhe Gan , Yu Cheng , Jingjing Liu

Text-to-image multimodal tasks, generating/retrieving an image from a given text description, are extremely challenging tasks since raw text descriptions cover quite limited information in order to fully describe visually realistic images.…

Computer Vision and Pattern Recognition · Computer Science 2020-10-27 Soyeon Caren Han , Siqu Long , Siwen Luo , Kunze Wang , Josiah Poon

The open-vocabulary image segmentation task involves partitioning images into semantically meaningful segments and classifying them with flexible text-defined categories. The recent vision-based foundation models such as the Segment…

Computer Vision and Pattern Recognition · Computer Science 2024-06-11 Xiaoqi Wang , Wenbin He , Xiwei Xuan , Clint Sebastian , Jorge Piazentin Ono , Xin Li , Sima Behpour , Thang Doan , Liang Gou , Han Wei Shen , Liu Ren

Vision-Language models (VLMs) have excelled in the image-domain -- especially in zero-shot settings -- thanks to the availability of vast pretraining data (i.e., paired image-text samples). However for videos, such paired data is not as…

Computer Vision and Pattern Recognition · Computer Science 2024-04-01 Kumara Kahatapitiya , Anurag Arnab , Arsha Nagrani , Michael S. Ryoo

Visual-semantic embedding aims to find a shared latent space where related visual and textual instances are close to each other. Most current methods learn injective embedding functions that map an instance to a single point in the shared…

Computer Vision and Pattern Recognition · Computer Science 2019-07-18 Yale Song , Mohammad Soleymani

Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often…

Computer Vision and Pattern Recognition · Computer Science 2026-02-04 Shenghao Fu , Yukun Su , Fengyun Rao , Jing Lyu , Xiaohua Xie , Wei-Shi Zheng

Semantic embeddings have advanced the state of the art for countless natural language processing tasks, and various extensions to multimodal domains, such as visual-semantic embeddings, have been proposed. While the power of visual-semantic…

Machine Learning · Computer Science 2021-02-23 Adam Dahlgren Lindström , Suna Bensch , Johanna Björklund , Frank Drewes

Vision-language models (VLMs) allow to embed texts and images in a shared representation space. However, it has been shown that these models are subject to a modality gap phenomenon meaning there exists a clear separation between the…

Computer Vision and Pattern Recognition · Computer Science 2025-05-07 François Role , Sébastien Meyer , Victor Amblard

Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reasoning tasks by combining visual representations with the abstract skill set large language models (LLMs) learn during pretraining. Vision,…

Artificial Intelligence · Computer Science 2023-09-01 Riley Tavassoli , Mani Amani , Reza Akhavian

The visual dialog task attempts to train an agent to answer multi-turn questions given an image, which requires the deep understanding of interactions between the image and dialog history. Existing researches tend to employ the…

Computation and Language · Computer Science 2022-02-23 Tong Ye , Shijing Si , Jianzong Wang , Rui Wang , Ning Cheng , Jing Xiao

Joint understanding of video and language is an active research area with many applications. Prior work in this domain typically relies on learning text-video embeddings. One difficulty with this approach, however, is the lack of…

Computer Vision and Pattern Recognition · Computer Science 2020-01-17 Antoine Miech , Ivan Laptev , Josef Sivic

Achieving human-like reasoning in deep learning models for complex tasks in unknown environments remains a critical challenge in embodied intelligence. While advanced vision-language models (VLMs) excel in static scene understanding, their…

Computer Vision and Pattern Recognition · Computer Science 2025-09-03 Jinzhou Tang , Jusheng zhang , Sidi Liu , Waikit Xiu , Qinhan Lv , Xiying Li

Recent studies have adapted generative Multimodal Large Language Models (MLLMs) into embedding extractors for vision tasks, typically through fine-tuning to produce universal representations. However, their performance on video remains…

Computer Vision and Pattern Recognition · Computer Science 2026-02-10 Issar Tzachor , Dvir Samuel , Rami Ben-Ari

The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key…

Computer Vision and Pattern Recognition · Computer Science 2025-12-09 Tiancheng Gu , Kaicheng Yang , Ziyong Feng , Xingjun Wang , Yanzhao Zhang , Dingkun Long , Yingda Chen , Weidong Cai , Jiankang Deng

We introduce Visual Caption Restoration (VCR), a novel vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images. This task stems from the observation that text embedded…

Computer Vision and Pattern Recognition · Computer Science 2025-04-21 Tianyu Zhang , Suyuchen Wang , Lu Li , Ge Zhang , Perouz Taslakian , Sai Rajeswar , Jie Fu , Bang Liu , Yoshua Bengio

Currently, vision encoder models like Vision Transformers (ViTs) typically excel at image recognition tasks but cannot simultaneously support text recognition like human visual recognition. To address this limitation, we propose UNIT, a…

Computer Vision and Pattern Recognition · Computer Science 2024-09-09 Yi Zhu , Yanpeng Zhou , Chunwei Wang , Yang Cao , Jianhua Han , Lu Hou , Hang Xu

Multi-modal retrieval has seen tremendous progress with the development of vision-language models. However, further improving these models require additional labelled data which is a huge manual effort. In this paper, we propose a framework…

Computer Vision and Pattern Recognition · Computer Science 2023-09-26 Avinash Madasu , Estelle Aflalo , Gabriela Ben Melech Stan , Shachar Rosenman , Shao-Yen Tseng , Gedas Bertasius , Vasudev Lal
‹ Prev 1 2 3 10 Next ›