Related papers: Visual Spatial Tuning

VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs

Vision language models (VLMs) are an exciting emerging class of language models (LMs) that have merged classic LM capabilities with those of image processing systems. However, the ways that these capabilities combine are not always…

Computation and Language · Computer Science 2024-07-03 Qiucheng Wu , Handong Zhao , Michael Saxon , Trung Bui , William Yang Wang , Yang Zhang , Shiyu Chang

How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective

Visual Spatial Reasoning (VSR) is a core human cognitive ability and a critical requirement for advancing embodied intelligence and autonomous systems. Despite recent progress in Vision-Language Models (VLMs), achieving human-level VSR…

Artificial Intelligence · Computer Science 2025-11-12 Songsong Yu , Yuxin Chen , Hao Ju , Lianjie Jia , Fuxi Zhang , Shaofei Huang , Yuhan Wu , Rundi Cui , Binghao Ran , Zaibin Zhang , Zhedong Zheng , Zhipeng Zhang , Yifan Wang , Lin Song , Lijun Wang , Yanwei Li , Ying Shan , Huchuan Lu

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. While Vision Language Models (VLM) have demonstrated remarkable performance in certain VQA benchmarks,…

Computer Vision and Pattern Recognition · Computer Science 2024-01-23 Boyuan Chen , Zhuo Xu , Sean Kirmani , Brian Ichter , Danny Driess , Pete Florence , Dorsa Sadigh , Leonidas Guibas , Fei Xia

SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data

Vision-language models (VLMs) work well in tasks ranging from image captioning to visual question answering (VQA), yet they struggle with spatial reasoning, a key skill for understanding our physical world that humans excel at. We find that…

Computer Vision and Pattern Recognition · Computer Science 2025-04-30 Michael Ogezi , Freda Shi

Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning

Vision-language models (VLMs) have advanced multimodal reasoning but still face challenges in spatial reasoning for 3D scenes and complex object configurations. To address this, we introduce SpatialViLT, an enhanced VLM that integrates…

Computer Vision and Pattern Recognition · Computer Science 2025-10-07 Chashi Mahiul Islam , Oteo Mamo , Samuel Jacob Chacko , Xiuwen Liu , Weikuan Yu

Enhancing Spatial Reasoning through Visual and Textual Thinking

The spatial reasoning task aims to reason about the spatial relationships in 2D and 3D space, which is a fundamental capability for Visual Question Answering (VQA) and robotics. Although vision language models (VLMs) have developed rapidly…

Computer Vision and Pattern Recognition · Computer Science 2025-07-29 Xun Liang , Xin Guo , Zhongming Jin , Weihang Pan , Penghui Shang , Deng Cai , Binbin Lin , Jieping Ye

ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models

Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Dingming Li , Hongxing Li , Zixuan Wang , Yuchen Yan , Hang Zhang , Siqi Chen , Guiyang Hou , Shengpei Jiang , Wenqi Zhang , Yongliang Shen , Weiming Lu , Yueting Zhuang

Visual Spatial Reasoning

Spatial relations are a basic part of human cognition. However, they are expressed in natural language in a variety of ways, and previous work has suggested that current vision-and-language models (VLMs) struggle to capture relational…

Computation and Language · Computer Science 2023-03-23 Fangyu Liu , Guy Emerson , Nigel Collier

InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models

Recent benchmarks and datasets have been proposed to improve spatial reasoning in vision-language models (VLMs), yet existing open resources remain limited in scale, visual diversity, and instruction expressiveness. In this work, we…

Computer Vision and Pattern Recognition · Computer Science 2025-06-24 Nianchen Deng , Lixin Gu , Shenglong Ye , Yinan He , Zhe Chen , Songze Li , Haomin Wang , Xingguang Wei , Tianshuo Yang , Min Dou , Tong He , Wenqi Shao , Kaipeng Zhang , Yi Wang , Botian Shi , Yanting Zhang , Jifeng Dai , Yu Qiao , Hongjie Zhang , Wenhai Wang

Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models

Vision-Language Models (VLMs) have recently emerged as powerful tools, excelling in tasks that integrate visual and textual comprehension, such as image captioning, visual question answering, and image-text retrieval. However, existing…

Computer Vision and Pattern Recognition · Computer Science 2025-03-26 Ilias Stogiannidis , Steven McDonagh , Sotirios A. Tsaftaris

SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models

Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap:…

Computer Vision and Pattern Recognition · Computer Science 2025-10-10 Hongxing Li , Dingming Li , Zixuan Wang , Yuchen Yan , Hang Wu , Wenqi Zhang , Yongliang Shen , Weiming Lu , Jun Xiao , Yueting Zhuang

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

Large language models (LLMs) and vision-language models (VLMs) have demonstrated remarkable performance across a wide range of tasks and domains. Despite this promise, spatial understanding and reasoning -- a fundamental component of human…

Computer Vision and Pattern Recognition · Computer Science 2024-11-06 Jiayu Wang , Yifei Ming , Zhenmei Shi , Vibhav Vineet , Xin Wang , Yixuan Li , Neel Joshi

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing…

Computer Vision and Pattern Recognition · Computer Science 2024-04-12 Kanchana Ranasinghe , Satya Narayan Shukla , Omid Poursaeed , Michael S. Ryoo , Tsung-Yu Lin

Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation

Spatial reasoning -- the ability to perceive and reason about relationships in space -- advances vision-language models (VLMs) from visual perception toward spatial semantic understanding. Existing approaches either revisit local image…

Computer Vision and Pattern Recognition · Computer Science 2026-01-06 Weijian Ma , Shizhao Sun , Tianyu Yu , Ruiyu Wang , Tat-Seng Chua , Jiang Bian

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods…

Computer Vision and Pattern Recognition · Computer Science 2025-06-23 Junfei Wu , Jian Guan , Kaituo Feng , Qiang Liu , Shu Wu , Liang Wang , Wei Wu , Tieniu Tan

RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception

Vision-language model (VLM) fine-tuning for application-specific visual grounding based on natural language instructions has become one of the most popular approaches for learning-enabled autonomous systems. However, such fine-tuning relies…

Computer Vision and Pattern Recognition · Computer Science 2025-02-03 Joshua R. Waite , Md. Zahid Hasan , Qisai Liu , Zhanhong Jiang , Chinmay Hegde , Soumik Sarkar

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also ``think in space'' from videos? We…

Computer Vision and Pattern Recognition · Computer Science 2025-07-04 Jihan Yang , Shusheng Yang , Anjali W. Gupta , Rilyn Han , Li Fei-Fei , Saining Xie

Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Spatial Reasoning

Vision language models (VLMs) perform well on many tasks but often fail at spatial reasoning, which is essential for navigation and interaction with physical environments. Many spatial reasoning tasks depend on fundamental two-dimensional…

Computer Vision and Pattern Recognition · Computer Science 2025-10-03 Yihong Tang , Ao Qu , Zhaokai Wang , Dingyi Zhuang , Zhaofeng Wu , Wei Ma , Shenhao Wang , Yunhan Zheng , Zhan Zhao , Jinhua Zhao

ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models

Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by…

Computer Vision and Pattern Recognition · Computer Science 2025-03-27 Dohwan Ko , Sihyeon Kim , Yumin Suh , Vijay Kumar B. G , Minseo Yoon , Manmohan Chandraker , Hyunwoo J. Kim

SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models

While vision language models (VLMs) excel in 2D semantic visual understanding, their ability to quantitatively reason about 3D spatial relationships remains under-explored, due to the deficiency of 2D images' spatial representation ability.…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Pingyi Chen , Yujing Lou , Shen Cao , Jinhui Guo , Lubin Fan , Yue Wu , Lin Yang , Lizhuang Ma , Jieping Ye