Related papers: Do Pre-trained Vision-Language Models Encode Objec…

Vision language models have difficulty recognizing virtual objects

Vision language models (VLMs) are AI systems paired with both language and vision encoders to process multimodal input. They are capable of performing complex semantic tasks such as automatic captioning, but it remains an open question…

Computer Vision and Pattern Recognition · Computer Science 2025-05-16 Tyler Tran , Sangeet Khemlani , J. G. Trafton

Can We Talk Models Into Seeing the World Differently?

Unlike traditional vision-only models, vision language models (VLMs) offer an intuitive way to access visual content through language prompting by combining a large language model (LLM) with a vision encoder. However, both the LLM and the…

Computer Vision and Pattern Recognition · Computer Science 2025-03-07 Paul Gavrikov , Jovita Lukasik , Steffen Jung , Robert Geirhos , M. Jehanzeb Mirza , Margret Keuper , Janis Keuper

Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

Vision-Language Pre-training (VLP) has advanced the performance of many vision-language tasks, such as image-text retrieval, visual entailment, and visual reasoning. The pre-training mostly utilizes lexical databases and image queries in…

Computation and Language · Computer Science 2023-06-30 Yasmine Karoui , Rémi Lebret , Negar Foroutan , Karl Aberer

Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving

The use of Vision-Language Models (VLMs) in automated driving applications is becoming increasingly common, with the aim of leveraging their reasoning and generalisation capabilities to handle long tail scenarios. However, these models…

Computer Vision and Pattern Recognition · Computer Science 2026-03-09 Nikos Theodoridis , Reenu Mohandas , Ganesh Sistu , Anthony Scanlan , Ciarán Eising , Tim Brophy

Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

Vision-Language Model (VLM) have gained widespread adoption in Open-Vocabulary (OV) object detection and segmentation tasks. Despite they have shown promise on OV-related tasks, their effectiveness in conventional vision tasks has thus far…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Yongchao Feng , Yajie Liu , Shuai Yang , Wenrui Cai , Jinqing Zhang , Qiqi Zhan , Ziyue Huang , Hongxi Yan , Qiao Wan , Chenguang Liu , Junzhe Wang , Jiahui Lv , Ziqi Liu , Tengyuan Shi , Qingjie Liu , Yunhong Wang

Can Language Models Understand Physical Concepts?

Language models~(LMs) gradually become general-purpose interfaces in the interactive and embodied world, where the understanding of physical concepts is an essential prerequisite. However, it is not yet clear whether LMs can understand…

Computation and Language · Computer Science 2023-05-24 Lei Li , Jingjing Xu , Qingxiu Dong , Ce Zheng , Qi Liu , Lingpeng Kong , Xu Sun

Teaching Structured Vision&Language Concepts to Vision&Language Models

Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, some aspects of complex language understanding still remain a challenge. We introduce the collective notion of Structured…

Computer Vision and Pattern Recognition · Computer Science 2023-06-01 Sivan Doveh , Assaf Arbelle , Sivan Harary , Rameswar Panda , Roei Herzig , Eli Schwartz , Donghyun Kim , Raja Giryes , Rogerio Feris , Shimon Ullman , Leonid Karlinsky

How Can Objects Help Video-Language Understanding?

Do we still need to represent objects explicitly in multimodal large language models (MLLMs)? To one extreme, pre-trained encoders convert images into visual tokens, with which objects and spatiotemporal relationships may be implicitly…

Computer Vision and Pattern Recognition · Computer Science 2025-08-06 Zitian Tang , Shijie Wang , Junho Cho , Jaewook Yoo , Chen Sun

Object Detection with Multimodal Large Vision-Language Models: An In-depth Review

The fusion of language and vision in large vision-language models (LVLMs) has revolutionized deep learning-based object detection by enhancing adaptability, contextual reasoning, and generalization beyond traditional architectures. This…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Ranjan Sapkota , Manoj Karkee

Improving Generalization of Language-Conditioned Robot Manipulation

The control of robots for manipulation tasks generally relies on visual input. Recent advances in vision-language models (VLMs) enable the use of natural language instructions to condition visual input and control robots in a wider range of…

Robotics · Computer Science 2025-08-05 Chenglin Cui , Chaoran Zhu , Changjae Oh , Andrea Cavallaro

Task-oriented Robotic Manipulation with Vision Language Models

Vision Language Models (VLMs) play a crucial role in robotic manipulation by enabling robots to understand and interpret the visual properties of objects and their surroundings, allowing them to perform manipulation based on this multimodal…

Robotics · Computer Science 2025-05-21 Nurhan Bulus Guran , Hanchi Ren , Jingjing Deng , Xianghua Xie

Learning Multiple Object States from Actions via Large Language Models

Recognizing the states of objects in a video is crucial in understanding the scene beyond actions and objects. For instance, an egg can be raw, cracked, and whisked while cooking an omelet, and these states can coexist simultaneously (an…

Computer Vision and Pattern Recognition · Computer Science 2024-11-08 Masatoshi Tateno , Takuma Yagi , Ryosuke Furuta , Yoichi Sato

Towards Interpreting Visual Information Processing in Vision-Language Models

Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images. We study the processing of visual tokens in the language model component of LLaVA, a prominent VLM. Our approach focuses on analyzing the…

Computer Vision and Pattern Recognition · Computer Science 2025-04-29 Clement Neo , Luke Ong , Philip Torr , Mor Geva , David Krueger , Fazl Barez

Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives

Vision-language modeling (VLM) aims to bridge the information gap between images and natural language. Under the new paradigm of first pre-training on massive image-text pairs and then fine-tuning on task-specific data, VLM in the remote…

Computer Vision and Pattern Recognition · Computer Science 2025-06-11 Xingxing Weng , Chao Pang , Gui-Song Xia

Leveraging Vision-Language Models for Open-Vocabulary Instance Segmentation and Tracking

Vision-language models (VLMs) excel in visual understanding but often lack reliable grounding capabilities and actionable inference rates. Integrating them with open-vocabulary object detection (OVD), instance segmentation, and tracking…

Computer Vision and Pattern Recognition · Computer Science 2025-10-21 Bastian Pätzold , Jan Nogga , Sven Behnke

Physically Grounded Vision-Language Models for Robotic Manipulation

Recent advances in vision-language models (VLMs) have led to improved performance on tasks such as visual question answering and image captioning. Consequently, these models are now well-positioned to reason about the physical world,…

Robotics · Computer Science 2024-03-05 Jensen Gao , Bidipta Sarkar , Fei Xia , Ted Xiao , Jiajun Wu , Brian Ichter , Anirudha Majumdar , Dorsa Sadigh

VLM's Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models

Vision language models (VLMs) have shown promising reasoning capabilities across various benchmarks; however, our understanding of their visual perception remains limited. In this work, we propose an eye examination process to investigate…

Computer Vision and Pattern Recognition · Computer Science 2024-09-24 Nam Hyeon-Woo , Moon Ye-Bin , Wonseok Choi , Lee Hyun , Tae-Hyun Oh

How Well Do Vision--Language Models Understand Cities? A Comparative Study on Spatial Reasoning from Street-View Images

Effectively understanding urban scenes requires fine-grained spatial reasoning about objects, layouts, and depth cues. However, how well current vision-language models (VLMs), pretrained on general scenes, transfer these abilities to urban…

Computer Vision and Pattern Recognition · Computer Science 2025-09-01 Juneyoung Ro , Namwoo Kim , Yoonjin Yoon

Textual Inversion for Efficient Adaptation of Open-Vocabulary Object Detectors Without Forgetting

Recent progress in large pre-trained vision language models (VLMs) has reached state-of-the-art performance on several object detection benchmarks and boasts strong zero-shot capabilities, but for optimal performance on specific targets…

Computer Vision and Pattern Recognition · Computer Science 2025-08-08 Frank Ruis , Gertjan Burghouts , Hugo Kuijf

Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

Vision-Language Models (VLMs) are trained on vast amounts of data captured by humans emulating our understanding of the world. However, known as visual illusions, human's perception of reality isn't always faithful to the physical world.…

Artificial Intelligence · Computer Science 2023-11-02 Yichi Zhang , Jiayi Pan , Yuchen Zhou , Rui Pan , Joyce Chai