Related papers: ObjEmbed: Towards Universal Multimodal Object Embe…

Explaining Multi-modal Large Language Models by Analyzing their Vision Perception

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in understanding and generating content across various modalities, such as images and text. However, their interpretability remains a challenge, hindering…

Computer Vision and Pattern Recognition · Computer Science 2024-05-29 Loris Giulivi , Giacomo Boracchi

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering. Recently, there has been a surge of interest in developing universal text embedding models that can…

Computer Vision and Pattern Recognition · Computer Science 2025-01-03 Ziyan Jiang , Rui Meng , Xinyi Yang , Semih Yavuz , Yingbo Zhou , Wenhu Chen

LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision--language understanding, yet they remain limited in tasks requiring precise object-level grounding, fine-grained spatial reasoning, and controllable…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Yuqian Yuan , Wenqiao Zhang , Juekai Lin , Yu Zhong , Mingjian Gao , Binhe Yu , Yunqi Cao , Wentong Li , Yueting Zhuang , Beng Chin Ooi

How Can Objects Help Video-Language Understanding?

Do we still need to represent objects explicitly in multimodal large language models (MLLMs)? To one extreme, pre-trained encoders convert images into visual tokens, with which objects and spatiotemporal relationships may be implicitly…

Computer Vision and Pattern Recognition · Computer Science 2025-08-06 Zitian Tang , Shijie Wang , Junho Cho , Jaewook Yoo , Chen Sun

Object Detection with Multimodal Large Vision-Language Models: An In-depth Review

The fusion of language and vision in large vision-language models (LVLMs) has revolutionized deep learning-based object detection by enhancing adaptability, contextual reasoning, and generalization beyond traditional architectures. This…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Ranjan Sapkota , Manoj Karkee

OCC-MLLM:Empowering Multimodal Large Language Model For the Understanding of Occluded Objects

There is a gap in the understanding of occluded objects in existing large-scale visual language multi-modal models. Current state-of-the-art multimodal models fail to provide satisfactory results in describing occluded objects for…

Computer Vision and Pattern Recognition · Computer Science 2024-10-03 Wenmo Qiu , Xinhan Di

Probing Multimodal Large Language Models for Global and Local Semantic Representations

The advancement of Multimodal Large Language Models (MLLMs) has greatly accelerated the development of applications in understanding integrated texts and images. Recent works leverage image-caption datasets to train MLLMs, achieving…

Computation and Language · Computer Science 2024-11-22 Mingxu Tao , Quzhe Huang , Kun Xu , Liwei Chen , Yansong Feng , Dongyan Zhao

Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs

Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval, which aims to find relevant items of various modalities for a given query. But their practical application is often hindered by the…

Computer Vision and Pattern Recognition · Computer Science 2026-02-06 Qi Li , Yanzhe Zhao , Yongxin Zhou , Yameng Wang , Yandong Yang , Yuanjia Zhou , Jue Wang , Zuojian Wang , Jinxiang Liu

CEMTM: Contextual Embedding-based Multimodal Topic Modeling

We introduce CEMTM, a context-enhanced multimodal topic model designed to infer coherent and interpretable topic structures from both short and long documents containing text and images. CEMTM builds on fine-tuned large vision language…

Computation and Language · Computer Science 2025-10-07 Amirhossein Abaskohi , Raymond Li , Chuyuan Li , Shafiq Joty , Giuseppe Carenini

MIEB: Massive Image Embedding Benchmark

Image representations are often evaluated through disjointed, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear whether an image embedding model adept at clustering images is…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Chenghao Xiao , Isaac Chung , Imene Kerboua , Jamie Stirling , Xin Zhang , Márton Kardos , Roman Solomatin , Noura Al Moubayed , Kenneth Enevoldsen , Niklas Muennighoff

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding

Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared…

Computation and Language · Computer Science 2025-11-04 Ahmed Masry , Juan A. Rodriguez , Tianyu Zhang , Suyuchen Wang , Chao Wang , Aarash Feizi , Akshay Kalkunte Suresh , Abhay Puri , Xiangru Jian , Pierre-André Noël , Sathwik Tejaswi Madhusudhan , Marco Pedersoli , Bang Liu , Nicolas Chapados , Yoshua Bengio , Enamul Hoque , Christopher Pal , Issam H. Laradji , David Vazquez , Perouz Taslakian , Spandana Gella , Sai Rajeswar

Contextual Object Detection with Multimodal Large Language Models

Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, i.e., object detection. In this work, we address this…

Computer Vision and Pattern Recognition · Computer Science 2024-08-13 Yuhang Zang , Wei Li , Jun Han , Kaiyang Zhou , Chen Change Loy

MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

State-of-the-art retrieval models typically address a straightforward search scenario, in which retrieval tasks are fixed (e.g., finding a passage to answer a specific question) and only a single modality is supported for both queries and…

Computation and Language · Computer Science 2025-02-25 Sheng-Chieh Lin , Chankyu Lee , Mohammad Shoeybi , Jimmy Lin , Bryan Catanzaro , Wei Ping

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME…

Computer Vision and Pattern Recognition · Computer Science 2025-07-08 Rui Meng , Ziyan Jiang , Ye Liu , Mingyi Su , Xinyi Yang , Yuepeng Fu , Can Qin , Zeyuan Chen , Ran Xu , Caiming Xiong , Yingbo Zhou , Wenhu Chen , Semih Yavuz

Accurate Word Representations with Universal Visual Guidance

Word representation is a fundamental component in neural language understanding models. Recently, pre-trained language models (PrLMs) offer a new performant method of contextualized word representations by leveraging the sequence-level…

Computation and Language · Computer Science 2021-01-01 Zhuosheng Zhang , Haojie Yu , Hai Zhao , Rui Wang , Masao Utiyama

OLIVE: Object Level In-Context Visual Embeddings

Recent generalist vision-language models (VLMs) have demonstrated impressive reasoning capabilities across diverse multimodal tasks. However, these models still struggle with fine-grained object-level understanding and grounding. In terms…

Computer Vision and Pattern Recognition · Computer Science 2024-06-04 Timothy Ossowski , Junjie Hu

RzenEmbed: Towards Comprehensive Multimodal Retrieval

The rapid advancement of Multimodal Large Language Models (MLLMs) has extended CLIP-based frameworks to produce powerful, universal embeddings for retrieval tasks. However, existing methods primarily focus on natural images, offering…

Computer Vision and Pattern Recognition · Computer Science 2025-11-03 Weijian Jian , Yajun Zhang , Dawei Liang , Chunyu Xie , Yixiao He , Dawei Leng , Yuhui Yin

embComp: Visual Interactive Comparison of Vector Embeddings

This paper introduces embComp, a novel approach for comparing two embeddings that capture the similarity between objects, such as word and document embeddings. We survey scenarios where comparing these embedding spaces is useful. From those…

Human-Computer Interaction · Computer Science 2021-06-03 Florian Heimerl , Christoph Kralj , Torsten Möller , Michael Gleicher

MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation

Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning. While recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation,…

Computer Vision and Pattern Recognition · Computer Science 2025-11-11 Jiaxin Huang , Runnan Chen , Ziwen Li , Zhengqing Gao , Xiao He , Yandong Guo , Mingming Gong , Tongliang Liu

VIRTUE: Visual-Interactive Text-Image Universal Embedder

Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities.…

Artificial Intelligence · Computer Science 2026-02-24 Wei-Yao Wang , Kazuya Tateishi , Qiyu Wu , Shusuke Takahashi , Yuki Mitsufuji