Related papers: SpatialBot: Precise Spatial Understanding with Vis…

SpatialPoint: Spatial-aware Point Prediction for Embodied Localization

Embodied intelligence fundamentally requires a capability to determine where to act in 3D space. We formalize this requirement as embodied localization -- the problem of predicting executable 3D points conditioned on visual observations and…

Robotics · Computer Science 2026-03-31 Qiming Zhu , Zhirui Fang , Tianming Zhang , Chuanxiu Liu , Xiaoke Jiang , Lei Zhang

Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning

Vision-language models (VLMs) have advanced multimodal reasoning but still face challenges in spatial reasoning for 3D scenes and complex object configurations. To address this, we introduce SpatialViLT, an enhanced VLM that integrates…

Computer Vision and Pattern Recognition · Computer Science 2025-10-07 Chashi Mahiul Islam , Oteo Mamo , Samuel Jacob Chacko , Xiuwen Liu , Weikuan Yu

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. While Vision Language Models (VLM) have demonstrated remarkable performance in certain VQA benchmarks,…

Computer Vision and Pattern Recognition · Computer Science 2024-01-23 Boyuan Chen , Zhuo Xu , Sean Kirmani , Brian Ichter , Danny Driess , Pete Florence , Dorsa Sadigh , Leonidas Guibas , Fei Xia

EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models

The recent rapid development of Large Vision-Language Models (LVLMs) has indicated their potential for embodied tasks.However, the critical skill of spatial understanding in embodied environments has not been thoroughly evaluated, leaving…

Artificial Intelligence · Computer Science 2024-06-11 Mengfei Du , Binhao Wu , Zejun Li , Xuanjing Huang , Zhongyu Wei

Embodied Spatial Intelligence: from Implicit Scene Modeling to Spatial Reasoning

This thesis introduces "Embodied Spatial Intelligence" to address the challenge of creating robots that can perceive and act in the real world based on natural language instructions. To bridge the gap between Large Language Models (LLMs)…

Robotics · Computer Science 2025-09-03 Jiading Fang

SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models

While vision language models (VLMs) excel in 2D semantic visual understanding, their ability to quantitatively reason about 3D spatial relationships remains under-explored, due to the deficiency of 2D images' spatial representation ability.…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Pingyi Chen , Yujing Lou , Shen Cao , Jinhui Guo , Lubin Fan , Yue Wu , Lin Yang , Lizhuang Ma , Jieping Ye

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models

Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision and language tasks. However, their ability to reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT (SpatialRGPT)…

Computer Vision and Pattern Recognition · Computer Science 2024-10-16 An-Chieh Cheng , Hongxu Yin , Yang Fu , Qiushan Guo , Ruihan Yang , Jan Kautz , Xiaolong Wang , Sifei Liu

Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes

Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet most evaluations of…

Computer Vision and Pattern Recognition · Computer Science 2026-03-03 Zhiyuan Feng , Zhaolu Kang , Qijie Wang , Zhiying Du , Jiongrui Yan , Shubin Shi , Chengbo Yuan , Huizhi Liang , Yu Deng , Qixiu Li , Rushuai Yang , Arctanx An , Leqi Zheng , Weijie Wang , Shawn Chen , Sicheng Xu , Yaobo Liang , Jiaolong Yang , Baining Guo

AirSpatialBot: A Spatially-Aware Aerial Agent for Fine-Grained Vehicle Attribute Recognization and Retrieval

Despite notable advancements in remote sensing vision-language models (VLMs), existing models often struggle with spatial understanding, limiting their effectiveness in real-world applications. To push the boundaries of VLMs in remote…

Computer Vision and Pattern Recognition · Computer Science 2026-01-06 Yue Zhou , Ran Ding , Xue Yang , Xue Jiang , Xingzhao Liu

SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning

Spatial reasoning is an essential problem in embodied AI research. Efforts to enhance spatial reasoning abilities through supplementary spatial data and fine-tuning have proven limited and ineffective when addressing complex embodied tasks,…

Robotics · Computer Science 2025-01-24 Yuecheng Liu , Dafeng Chi , Shiguang Wu , Zhanguang Zhang , Yaochen Hu , Lingfeng Zhang , Yingxue Zhang , Shuang Wu , Tongtong Cao , Guowei Huang , Helong Huang , Guangjian Tian , Weichao Qiu , Xingyue Quan , Jianye Hao , Yuzheng Zhuang

Embodied Scene Understanding for Vision Language Models via MetaVQA

Vision Language Models (VLMs) demonstrate significant potential as embodied AI agents for various mobility applications. However, a standardized, closed-loop benchmark for evaluating their spatial reasoning and sequential decision-making…

Computer Vision and Pattern Recognition · Computer Science 2025-01-17 Weizhen Wang , Chenda Duan , Zhenghao Peng , Yuxin Liu , Bolei Zhou

The Spatial Blindspot of Vision-Language Models

Vision-language models (VLMs) have advanced rapidly, but their ability to capture spatial relationships remains a blindspot. Current VLMs are typically built with contrastive language-image pretraining (CLIP) style image encoders. The…

Computer Vision and Pattern Recognition · Computer Science 2026-01-26 Nahid Alam , Leema Krishna Murali , Siddhant Bharadwaj , Patrick Liu , Timothy Chung , Drishti Sharma , Akshata A , Kranthi Kiran , Wesley Tam , Bala Krishna S Vegesna

ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models

Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Dingming Li , Hongxing Li , Zixuan Wang , Yuchen Yan , Hang Zhang , Siqi Chen , Guiyang Hou , Shengpei Jiang , Wenqi Zhang , Yongliang Shen , Weiming Lu , Yueting Zhuang

Spatial Understanding from Videos: Structured Prompts Meet Simulation Data

Visual-spatial understanding, the ability to infer object relationships and layouts from visual input, is fundamental to downstream tasks such as robotic navigation and embodied interaction. However, existing methods face spatial…

Computer Vision and Pattern Recognition · Computer Science 2025-09-22 Haoyu Zhang , Meng Liu , Zaijing Li , Haokun Wen , Weili Guan , Yaowei Wang , Liqiang Nie

Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds

We present a conceptual framework for training Vision-Language Models (VLMs) to perform Visual Perspective Taking (VPT), a core capability for embodied cognition essential for Human-Robot Interaction (HRI). As a first step toward this goal,…

Artificial Intelligence · Computer Science 2025-05-21 Joel Currie , Gioele Migno , Enrico Piacenti , Maria Elena Giannaccini , Patric Bach , Davide De Tommaso , Agnieszka Wykowska

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or…

Computer Vision and Pattern Recognition · Computer Science 2026-05-20 Diankun Wu , Fangfu Liu , Yi-Hsin Hung , Yueqi Duan

Embodied3DBench: Benchmarking Low-Level Embodied Spatial Intelligence of Vision Language Models

Are current Vision Language Models (VLMs) ready to comprehend and reason about complex embodied interactions in 3D environments? We introduce Embodied3DBench, a robot-centric benchmark targeting low-level spatial intelligence in embodied 3D…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Jiyao Zhang , Mingxu Zhang , Yitong Peng , Haoxuan Liu , Chenshuo Wang , Yuxing Long , Haoyang Huang , Dongjiang Li , Nan Duan , Hui Shen , Hao Dong

LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks

Real-world applications, such as autonomous driving and humanoid robot manipulation, require precise spatial perception. However, it remains underexplored how Vision-Language Models (VLMs) recognize spatial relationships and perceive…

Computer Vision and Pattern Recognition · Computer Science 2026-02-24 Fei Kong , Jinhao Duan , Kaidi Xu , Zhenhua Guo , Xiaofeng Zhu , Xiaoshuang Shi

SpatialMath: Spatial Comprehension-Infused Symbolic Reasoning for Mathematical Problem-Solving

Multimodal Small-to-Medium sized Language Models (MSLMs) have demonstrated strong capabilities in integrating visual and textual information but still face significant limitations in visual comprehension and mathematical reasoning,…

Machine Learning · Computer Science 2026-01-27 Ashutosh Bajpai , Akshat Bhandari , Akshay Nambi , Tanmoy Chakraborty

STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?

The use of Multimodal Large Language Models (MLLMs) as an end-to-end solution for Embodied AI and Autonomous Driving has become a prevailing trend. While MLLMs have been extensively studied for visual semantic understanding tasks, their…

Computer Vision and Pattern Recognition · Computer Science 2025-07-18 Yun Li , Yiming Zhang , Tao Lin , Xiangrui Liu , Wenxiao Cai , Zheng Liu , Bo Zhao