English
Related papers

Related papers: Visual Spatial Tuning

200 papers

Vision language models (VLMs) are an exciting emerging class of language models (LMs) that have merged classic LM capabilities with those of image processing systems. However, the ways that these capabilities combine are not always…

Computation and Language · Computer Science 2024-07-03 Qiucheng Wu , Handong Zhao , Michael Saxon , Trung Bui , William Yang Wang , Yang Zhang , Shiyu Chang

Visual Spatial Reasoning (VSR) is a core human cognitive ability and a critical requirement for advancing embodied intelligence and autonomous systems. Despite recent progress in Vision-Language Models (VLMs), achieving human-level VSR…

Understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. While Vision Language Models (VLM) have demonstrated remarkable performance in certain VQA benchmarks,…

Computer Vision and Pattern Recognition · Computer Science 2024-01-23 Boyuan Chen , Zhuo Xu , Sean Kirmani , Brian Ichter , Danny Driess , Pete Florence , Dorsa Sadigh , Leonidas Guibas , Fei Xia

Vision-language models (VLMs) work well in tasks ranging from image captioning to visual question answering (VQA), yet they struggle with spatial reasoning, a key skill for understanding our physical world that humans excel at. We find that…

Computer Vision and Pattern Recognition · Computer Science 2025-04-30 Michael Ogezi , Freda Shi

Vision-language models (VLMs) have advanced multimodal reasoning but still face challenges in spatial reasoning for 3D scenes and complex object configurations. To address this, we introduce SpatialViLT, an enhanced VLM that integrates…

Computer Vision and Pattern Recognition · Computer Science 2025-10-07 Chashi Mahiul Islam , Oteo Mamo , Samuel Jacob Chacko , Xiuwen Liu , Weikuan Yu

The spatial reasoning task aims to reason about the spatial relationships in 2D and 3D space, which is a fundamental capability for Visual Question Answering (VQA) and robotics. Although vision language models (VLMs) have developed rapidly…

Computer Vision and Pattern Recognition · Computer Science 2025-07-29 Xun Liang , Xin Guo , Zhongming Jin , Weihang Pan , Penghui Shang , Deng Cai , Binbin Lin , Jieping Ye

Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Dingming Li , Hongxing Li , Zixuan Wang , Yuchen Yan , Hang Zhang , Siqi Chen , Guiyang Hou , Shengpei Jiang , Wenqi Zhang , Yongliang Shen , Weiming Lu , Yueting Zhuang

Spatial relations are a basic part of human cognition. However, they are expressed in natural language in a variety of ways, and previous work has suggested that current vision-and-language models (VLMs) struggle to capture relational…

Computation and Language · Computer Science 2023-03-23 Fangyu Liu , Guy Emerson , Nigel Collier

Recent benchmarks and datasets have been proposed to improve spatial reasoning in vision-language models (VLMs), yet existing open resources remain limited in scale, visual diversity, and instruction expressiveness. In this work, we…

Vision-Language Models (VLMs) have recently emerged as powerful tools, excelling in tasks that integrate visual and textual comprehension, such as image captioning, visual question answering, and image-text retrieval. However, existing…

Computer Vision and Pattern Recognition · Computer Science 2025-03-26 Ilias Stogiannidis , Steven McDonagh , Sotirios A. Tsaftaris

Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap:…

Computer Vision and Pattern Recognition · Computer Science 2025-10-10 Hongxing Li , Dingming Li , Zixuan Wang , Yuchen Yan , Hang Wu , Wenqi Zhang , Yongliang Shen , Weiming Lu , Jun Xiao , Yueting Zhuang

Large language models (LLMs) and vision-language models (VLMs) have demonstrated remarkable performance across a wide range of tasks and domains. Despite this promise, spatial understanding and reasoning -- a fundamental component of human…

Computer Vision and Pattern Recognition · Computer Science 2024-11-06 Jiayu Wang , Yifei Ming , Zhenmei Shi , Vibhav Vineet , Xin Wang , Yixuan Li , Neel Joshi

Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing…

Computer Vision and Pattern Recognition · Computer Science 2024-04-12 Kanchana Ranasinghe , Satya Narayan Shukla , Omid Poursaeed , Michael S. Ryoo , Tsung-Yu Lin

Spatial reasoning -- the ability to perceive and reason about relationships in space -- advances vision-language models (VLMs) from visual perception toward spatial semantic understanding. Existing approaches either revisit local image…

Computer Vision and Pattern Recognition · Computer Science 2026-01-06 Weijian Ma , Shizhao Sun , Tianyu Yu , Ruiyu Wang , Tat-Seng Chua , Jiang Bian

As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods…

Computer Vision and Pattern Recognition · Computer Science 2025-06-23 Junfei Wu , Jian Guan , Kaituo Feng , Qiang Liu , Shu Wu , Liang Wang , Wei Wu , Tieniu Tan

Vision-language model (VLM) fine-tuning for application-specific visual grounding based on natural language instructions has become one of the most popular approaches for learning-enabled autonomous systems. However, such fine-tuning relies…

Computer Vision and Pattern Recognition · Computer Science 2025-02-03 Joshua R. Waite , Md. Zahid Hasan , Qisai Liu , Zhanhong Jiang , Chinmay Hegde , Soumik Sarkar

Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also ``think in space'' from videos? We…

Computer Vision and Pattern Recognition · Computer Science 2025-07-04 Jihan Yang , Shusheng Yang , Anjali W. Gupta , Rilyn Han , Li Fei-Fei , Saining Xie

Vision language models (VLMs) perform well on many tasks but often fail at spatial reasoning, which is essential for navigation and interaction with physical environments. Many spatial reasoning tasks depend on fundamental two-dimensional…

Computer Vision and Pattern Recognition · Computer Science 2025-10-03 Yihong Tang , Ao Qu , Zhaokai Wang , Dingyi Zhuang , Zhaofeng Wu , Wei Ma , Shenhao Wang , Yunhan Zheng , Zhan Zhao , Jinhua Zhao

Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by…

Computer Vision and Pattern Recognition · Computer Science 2025-03-27 Dohwan Ko , Sihyeon Kim , Yumin Suh , Vijay Kumar B. G , Minseo Yoon , Manmohan Chandraker , Hyunwoo J. Kim

While vision language models (VLMs) excel in 2D semantic visual understanding, their ability to quantitatively reason about 3D spatial relationships remains under-explored, due to the deficiency of 2D images' spatial representation ability.…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Pingyi Chen , Yujing Lou , Shen Cao , Jinhui Guo , Lubin Fan , Yue Wu , Lin Yang , Lizhuang Ma , Jieping Ye
‹ Prev 1 2 3 10 Next ›