Related papers: OLIVE: Object Level In-Context Visual Embeddings

Context Matters: Learning Global Semantics via Object-Centric Representation

Recent advances in language modeling have witnessed the rise of highly desirable emergent capabilities, such as reasoning and in-context learning. However, vision models have yet to exhibit comparable progress in these areas. In this paper,…

Computer Vision and Pattern Recognition · Computer Science 2025-10-10 Jike Zhong , Yuxiang Lai , Xiaofeng Yang , Konstantinos Psounis

Leveraging Vision-Language Models for Open-Vocabulary Instance Segmentation and Tracking

Vision-language models (VLMs) excel in visual understanding but often lack reliable grounding capabilities and actionable inference rates. Integrating them with open-vocabulary object detection (OVD), instance segmentation, and tracking…

Computer Vision and Pattern Recognition · Computer Science 2025-10-21 Bastian Pätzold , Jan Nogga , Sven Behnke

Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation

Open-Vocabulary Semantic Segmentation (OVSS) has advanced with recent vision-language models (VLMs), enabling segmentation beyond predefined categories through various learning schemes. Notably, training-free methods offer scalable, easily…

Computer Vision and Pattern Recognition · Computer Science 2025-04-08 Chanyoung Kim , Dayun Ju , Woojung Han , Ming-Hsuan Yang , Seong Jae Hwang

Object Detection with Multimodal Large Vision-Language Models: An In-depth Review

The fusion of language and vision in large vision-language models (LVLMs) has revolutionized deep learning-based object detection by enhancing adaptability, contextual reasoning, and generalization beyond traditional architectures. This…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Ranjan Sapkota , Manoj Karkee

VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors

Vision-language models (VLMs) have achieved impressive performance across a wide range of multimodal tasks. However, they often fail on tasks that require fine-grained visual perception, even when the required information is still present…

Computer Vision and Pattern Recognition · Computer Science 2026-04-16 Haz Sameen Shahgir , Xiaofu Chen , Yu Fu , Erfan Shayegani , Nael Abu-Ghazaleh , Yova Kementchedjhieva , Yue Dong

Memory-Augmented Vision-Language Agents for Persistent and Semantically Consistent Object Captioning

Vision-Language Models (VLMs) often yield inconsistent descriptions of the same object across viewpoints, hindering the ability of embodied agents to construct consistent semantic representations over time. Previous methods resolved…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Tommaso Galliena , Stefano Rosa , Tommaso Apicella , Pietro Morerio , Alessio Del Bue , Lorenzo Natale

Vision-Language Models for Edge Networks: A Comprehensive Survey

Vision Large Language Models (VLMs) combine visual understanding with natural language processing, enabling tasks like image captioning, visual question answering, and video analysis. While VLMs show impressive capabilities across domains…

Computer Vision and Pattern Recognition · Computer Science 2025-06-18 Ahmed Sharshar , Latif U. Khan , Waseem Ullah , Mohsen Guizani

Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness

Vision language models (VLMs) have achieved remarkable success in broad visual understanding, yet they remain challenged by object-centric reasoning on rare objects due to the scarcity of such instances in pretraining data. While prior…

Computer Vision and Pattern Recognition · Computer Science 2026-02-24 Xin Hu , Haomiao Ni , Yunbei Zhang , Jihun Hamm , Zechen Li , Zhengming Ding

How Can Objects Help Video-Language Understanding?

Do we still need to represent objects explicitly in multimodal large language models (MLLMs)? To one extreme, pre-trained encoders convert images into visual tokens, with which objects and spatiotemporal relationships may be implicitly…

Computer Vision and Pattern Recognition · Computer Science 2025-08-06 Zitian Tang , Shijie Wang , Junho Cho , Jaewook Yoo , Chen Sun

Object Retrieval for Visual Question Answering with Outside Knowledge

Retrieval-augmented generation (RAG) with large language models (LLMs) plays a crucial role in question answering, as LLMs possess limited knowledge and are not updated with continuously growing information. Most recent work on RAG has…

Computer Vision and Pattern Recognition · Computer Science 2025-07-01 Shichao Kan , Yuhai Deng , Jiale Fu , Lihui Cen , Zhe Qu , Linna Zhang , Yixiong Liang , Yigang Cen

Visual In-Context Learning for Large Vision-Language Models

In Large Visual Language Models (LVLMs), the efficacy of In-Context Learning (ICL) remains limited by challenges in cross-modal interactions and representation disparities. To overcome these challenges, we introduce a novel Visual…

Computer Vision and Pattern Recognition · Computer Science 2024-02-20 Yucheng Zhou , Xiang Li , Qianning Wang , Jianbing Shen

VinVL: Revisiting Visual Representations in Vision-Language Models

This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used…

Computer Vision and Pattern Recognition · Computer Science 2021-03-11 Pengchuan Zhang , Xiujun Li , Xiaowei Hu , Jianwei Yang , Lei Zhang , Lijuan Wang , Yejin Choi , Jianfeng Gao

Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving

Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs) and use end-to-end training to achieve multi-modal understanding in a unified…

Artificial Intelligence · Computer Science 2025-08-14 Zixian Guo , Ming Liu , Qilong Wang , Zhilong Ji , Jinfeng Bai , Lei Zhang , Wangmeng Zuo

Improving Generalization of Language-Conditioned Robot Manipulation

The control of robots for manipulation tasks generally relies on visual input. Recent advances in vision-language models (VLMs) enable the use of natural language instructions to condition visual input and control robots in a wider range of…

Robotics · Computer Science 2025-08-05 Chenglin Cui , Chaoran Zhu , Changjae Oh , Andrea Cavallaro

Latent Implicit Visual Reasoning

While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are…

Computer Vision and Pattern Recognition · Computer Science 2025-12-25 Kelvin Li , Chuyi Shang , Leonid Karlinsky , Rogerio Feris , Trevor Darrell , Roei Herzig

Vision language models have difficulty recognizing virtual objects

Vision language models (VLMs) are AI systems paired with both language and vision encoders to process multimodal input. They are capable of performing complex semantic tasks such as automatic captioning, but it remains an open question…

Computer Vision and Pattern Recognition · Computer Science 2025-05-16 Tyler Tran , Sangeet Khemlani , J. G. Trafton

Enhancing Subsequent Video Retrieval via Vision-Language Models (VLMs)

The rapid growth of video content demands efficient and precise retrieval systems. While vision-language models (VLMs) excel in representation learning, they often struggle with adaptive, time-sensitive video retrieval. This paper…

Computer Vision and Pattern Recognition · Computer Science 2025-03-25 Yicheng Duan , Xi Huang , Duo Chen

Visually-Augmented Language Modeling

Human language is grounded on multimodal knowledge including visual knowledge like colors, sizes, and shapes. However, current large-scale pre-trained language models rely on text-only self-supervised training with massive text data, which…

Computation and Language · Computer Science 2023-02-28 Weizhi Wang , Li Dong , Hao Cheng , Haoyu Song , Xiaodong Liu , Xifeng Yan , Jianfeng Gao , Furu Wei

Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding

Visual grounding seeks to localize the image region corresponding to a free-form text description. Recently, the strong multimodal capabilities of Large Vision-Language Models (LVLMs) have driven substantial improvements in visual…

Computer Vision and Pattern Recognition · Computer Science 2025-03-11 Seil Kang , Jinyeong Kim , Junhyeok Kim , Seong Jae Hwang

Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

Vision-Language Model (VLM) have gained widespread adoption in Open-Vocabulary (OV) object detection and segmentation tasks. Despite they have shown promise on OV-related tasks, their effectiveness in conventional vision tasks has thus far…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Yongchao Feng , Yajie Liu , Shuai Yang , Wenrui Cai , Jinqing Zhang , Qiqi Zhan , Ziyue Huang , Hongxi Yan , Qiao Wan , Chenguang Liu , Junzhe Wang , Jiahui Lv , Ziqi Liu , Tengyuan Shi , Qingjie Liu , Yunhong Wang