English
Related papers

Related papers: ObjEmbed: Towards Universal Multimodal Object Embe…

200 papers

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in understanding and generating content across various modalities, such as images and text. However, their interpretability remains a challenge, hindering…

Computer Vision and Pattern Recognition · Computer Science 2024-05-29 Loris Giulivi , Giacomo Boracchi

Embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering. Recently, there has been a surge of interest in developing universal text embedding models that can…

Computer Vision and Pattern Recognition · Computer Science 2025-01-03 Ziyan Jiang , Rui Meng , Xinyi Yang , Semih Yavuz , Yingbo Zhou , Wenhu Chen

Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision--language understanding, yet they remain limited in tasks requiring precise object-level grounding, fine-grained spatial reasoning, and controllable…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Yuqian Yuan , Wenqiao Zhang , Juekai Lin , Yu Zhong , Mingjian Gao , Binhe Yu , Yunqi Cao , Wentong Li , Yueting Zhuang , Beng Chin Ooi

Do we still need to represent objects explicitly in multimodal large language models (MLLMs)? To one extreme, pre-trained encoders convert images into visual tokens, with which objects and spatiotemporal relationships may be implicitly…

Computer Vision and Pattern Recognition · Computer Science 2025-08-06 Zitian Tang , Shijie Wang , Junho Cho , Jaewook Yoo , Chen Sun

The fusion of language and vision in large vision-language models (LVLMs) has revolutionized deep learning-based object detection by enhancing adaptability, contextual reasoning, and generalization beyond traditional architectures. This…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Ranjan Sapkota , Manoj Karkee

There is a gap in the understanding of occluded objects in existing large-scale visual language multi-modal models. Current state-of-the-art multimodal models fail to provide satisfactory results in describing occluded objects for…

Computer Vision and Pattern Recognition · Computer Science 2024-10-03 Wenmo Qiu , Xinhan Di

The advancement of Multimodal Large Language Models (MLLMs) has greatly accelerated the development of applications in understanding integrated texts and images. Recent works leverage image-caption datasets to train MLLMs, achieving…

Computation and Language · Computer Science 2024-11-22 Mingxu Tao , Quzhe Huang , Kun Xu , Liwei Chen , Yansong Feng , Dongyan Zhao

Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval, which aims to find relevant items of various modalities for a given query. But their practical application is often hindered by the…

Computer Vision and Pattern Recognition · Computer Science 2026-02-06 Qi Li , Yanzhe Zhao , Yongxin Zhou , Yameng Wang , Yandong Yang , Yuanjia Zhou , Jue Wang , Zuojian Wang , Jinxiang Liu

We introduce CEMTM, a context-enhanced multimodal topic model designed to infer coherent and interpretable topic structures from both short and long documents containing text and images. CEMTM builds on fine-tuned large vision language…

Computation and Language · Computer Science 2025-10-07 Amirhossein Abaskohi , Raymond Li , Chuyuan Li , Shafiq Joty , Giuseppe Carenini

Image representations are often evaluated through disjointed, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear whether an image embedding model adept at clustering images is…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Chenghao Xiao , Isaac Chung , Imene Kerboua , Jamie Stirling , Xin Zhang , Márton Kardos , Roman Solomatin , Noura Al Moubayed , Kenneth Enevoldsen , Niklas Muennighoff

Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared…

Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, i.e., object detection. In this work, we address this…

Computer Vision and Pattern Recognition · Computer Science 2024-08-13 Yuhang Zang , Wei Li , Jun Han , Kaiyang Zhou , Chen Change Loy

State-of-the-art retrieval models typically address a straightforward search scenario, in which retrieval tasks are fixed (e.g., finding a passage to answer a specific question) and only a single modality is supported for both queries and…

Computation and Language · Computer Science 2025-02-25 Sheng-Chieh Lin , Chankyu Lee , Mohammad Shoeybi , Jimmy Lin , Bryan Catanzaro , Wei Ping

Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME…

Computer Vision and Pattern Recognition · Computer Science 2025-07-08 Rui Meng , Ziyan Jiang , Ye Liu , Mingyi Su , Xinyi Yang , Yuepeng Fu , Can Qin , Zeyuan Chen , Ran Xu , Caiming Xiong , Yingbo Zhou , Wenhu Chen , Semih Yavuz

Word representation is a fundamental component in neural language understanding models. Recently, pre-trained language models (PrLMs) offer a new performant method of contextualized word representations by leveraging the sequence-level…

Computation and Language · Computer Science 2021-01-01 Zhuosheng Zhang , Haojie Yu , Hai Zhao , Rui Wang , Masao Utiyama

Recent generalist vision-language models (VLMs) have demonstrated impressive reasoning capabilities across diverse multimodal tasks. However, these models still struggle with fine-grained object-level understanding and grounding. In terms…

Computer Vision and Pattern Recognition · Computer Science 2024-06-04 Timothy Ossowski , Junjie Hu

The rapid advancement of Multimodal Large Language Models (MLLMs) has extended CLIP-based frameworks to produce powerful, universal embeddings for retrieval tasks. However, existing methods primarily focus on natural images, offering…

Computer Vision and Pattern Recognition · Computer Science 2025-11-03 Weijian Jian , Yajun Zhang , Dawei Liang , Chunyu Xie , Yixiao He , Dawei Leng , Yuhui Yin

This paper introduces embComp, a novel approach for comparing two embeddings that capture the similarity between objects, such as word and document embeddings. We survey scenarios where comparing these embedding spaces is useful. From those…

Human-Computer Interaction · Computer Science 2021-06-03 Florian Heimerl , Christoph Kralj , Torsten Möller , Michael Gleicher

Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning. While recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation,…

Computer Vision and Pattern Recognition · Computer Science 2025-11-11 Jiaxin Huang , Runnan Chen , Ziwen Li , Zhengqing Gao , Xiao He , Yandong Guo , Mingming Gong , Tongliang Liu

Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities.…

Artificial Intelligence · Computer Science 2026-02-24 Wei-Yao Wang , Kazuya Tateishi , Qiyu Wu , Shusuke Takahashi , Yuki Mitsufuji
‹ Prev 1 2 3 10 Next ›