Related papers: Spatial-ViLT: Enhancing Visual Spatial Reasoning t…

Enhancing Spatial Reasoning through Visual and Textual Thinking

The spatial reasoning task aims to reason about the spatial relationships in 2D and 3D space, which is a fundamental capability for Visual Question Answering (VQA) and robotics. Although vision language models (VLMs) have developed rapidly…

Computer Vision and Pattern Recognition · Computer Science 2025-07-29 Xun Liang , Xin Guo , Zhongming Jin , Weihang Pan , Penghui Shang , Deng Cai , Binbin Lin , Jieping Ye

Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models

New era has unlocked exciting possibilities for extending Large Language Models (LLMs) to tackle 3D vision-language tasks. However, most existing 3D multimodal LLMs (MLLMs) rely on compressing holistic 3D scene information or segmenting…

Computer Vision and Pattern Recognition · Computer Science 2025-07-23 Xiaoyan Wang , Zeju Li , Yifan Xu , Jiaxing Qi , Zhifei Yang , Ruifei Ma , Xiangde Liu , Chao Zhang

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or…

Computer Vision and Pattern Recognition · Computer Science 2026-05-20 Diankun Wu , Fangfu Liu , Yi-Hsin Hung , Yueqi Duan

Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision

Robot vision has greatly benefited from advancements in multimodal fusion techniques and vision-language models (VLMs). We adopt a task-oriented perspective to systematically review the applications and advancements of multimodal fusion…

Robotics · Computer Science 2025-10-16 Xiaofeng Han , Shunpeng Chen , Zenghuang Fu , Zhe Feng , Lue Fan , Dong An , Changwei Wang , Li Guo , Weiliang Meng , Xiaopeng Zhang , Rongtao Xu , Shibiao Xu

SpatialMath: Spatial Comprehension-Infused Symbolic Reasoning for Mathematical Problem-Solving

Multimodal Small-to-Medium sized Language Models (MSLMs) have demonstrated strong capabilities in integrating visual and textual information but still face significant limitations in visual comprehension and mathematical reasoning,…

Machine Learning · Computer Science 2026-01-27 Ashutosh Bajpai , Akshat Bhandari , Akshay Nambi , Tanmoy Chakraborty

SpatialMosaic: A Multiview VLM Dataset for Partial Visibility

The rapid progress of Multimodal Large Language Models (MLLMs) has unlocked the potential for enhanced 3D scene understanding and spatial reasoning. A recent line of work explores learning spatial reasoning directly from multi-view images,…

Computer Vision and Pattern Recognition · Computer Science 2026-04-10 Kanghee Lee , Injae Lee , Minseok Kwak , Jungi Hong , Kwonyoung Ryu , Jaesik Park

ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models

Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Dingming Li , Hongxing Li , Zixuan Wang , Yuchen Yan , Hang Zhang , Siqi Chen , Guiyang Hou , Shengpei Jiang , Wenqi Zhang , Yongliang Shen , Weiming Lu , Yueting Zhuang

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. While Vision Language Models (VLM) have demonstrated remarkable performance in certain VQA benchmarks,…

Computer Vision and Pattern Recognition · Computer Science 2024-01-23 Boyuan Chen , Zhuo Xu , Sean Kirmani , Brian Ichter , Danny Driess , Pete Florence , Dorsa Sadigh , Leonidas Guibas , Fei Xia

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models

Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision and language tasks. However, their ability to reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT (SpatialRGPT)…

Computer Vision and Pattern Recognition · Computer Science 2024-10-16 An-Chieh Cheng , Hongxu Yin , Yang Fu , Qiushan Guo , Ruihan Yang , Jan Kautz , Xiaolong Wang , Sifei Liu

SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning

Spatial reasoning is an essential problem in embodied AI research. Efforts to enhance spatial reasoning abilities through supplementary spatial data and fine-tuning have proven limited and ineffective when addressing complex embodied tasks,…

Robotics · Computer Science 2025-01-24 Yuecheng Liu , Dafeng Chi , Shiguang Wu , Zhanguang Zhang , Yaochen Hu , Lingfeng Zhang , Yingxue Zhang , Shuang Wu , Tongtong Cao , Guowei Huang , Helong Huang , Guangjian Tian , Weichao Qiu , Xingyue Quan , Jianye Hao , Yuzheng Zhuang

LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models

Spatial reasoning is a fundamental aspect of human cognition, enabling intuitive understanding and manipulation of objects in three-dimensional space. While foundation models demonstrate remarkable performance on some benchmarks, they still…

Computer Vision and Pattern Recognition · Computer Science 2025-03-12 Fan-Yun Sun , Weiyu Liu , Siyi Gu , Dylan Lim , Goutam Bhat , Federico Tombari , Manling Li , Nick Haber , Jiajun Wu

SpatialBot: Precise Spatial Understanding with Vision Language Models

Vision Language Models (VLMs) have achieved impressive performance in 2D image understanding, however they are still struggling with spatial understanding which is the foundation of Embodied AI. In this paper, we propose SpatialBot for…

Computer Vision and Pattern Recognition · Computer Science 2025-03-20 Wenxiao Cai , Iaroslav Ponomarenko , Jianhao Yuan , Xiaoqi Li , Wankou Yang , Hao Dong , Bo Zhao

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods…

Computer Vision and Pattern Recognition · Computer Science 2025-06-23 Junfei Wu , Jian Guan , Kaituo Feng , Qiang Liu , Shu Wu , Liang Wang , Wei Wu , Tieniu Tan

A Multi-Modal Neuro-Symbolic Approach for Spatial Reasoning-Based Visual Grounding in Robotics

Visual reasoning, particularly spatial reasoning, is a challenging cognitive task that requires understanding object relationships and their interactions within complex environments, especially in robotics domain. Existing vision_language…

Robotics · Computer Science 2025-11-03 Simindokht Jahangard , Mehrzad Mohammadi , Abhinav Dhall , Hamid Rezatofighi

S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance

3D Visual Grounding (3DVG) focuses on locating objects in 3D scenes based on natural language descriptions, serving as a fundamental task for embodied AI and robotics. Recent advances in Multi-modal Large Language Models (MLLMs) have…

Computer Vision and Pattern Recognition · Computer Science 2025-12-02 Beining Xu , Siting Zhu , Zhao Jin , Junxian Li , Hesheng Wang

Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models

Vision-Language Models (VLMs) have recently emerged as powerful tools, excelling in tasks that integrate visual and textual comprehension, such as image captioning, visual question answering, and image-text retrieval. However, existing…

Computer Vision and Pattern Recognition · Computer Science 2025-03-26 Ilias Stogiannidis , Steven McDonagh , Sotirios A. Tsaftaris

Learning Multi-View Spatial Reasoning from Cross-View Relations

Vision-language models (VLMs) have achieved impressive results on single-view vision tasks, but lack the multi-view spatial reasoning capabilities essential for embodied AI systems to understand 3D environments and manipulate objects across…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Suchae Jeong , Jaehwi Song , Haeone Lee , Hanna Kim , Jian Kim , Dongjun Lee , Dong Kyu Shin , Changyeon Kim , Dongyoon Hahm , Woogyeol Jin , Juheon Choi , Kimin Lee

Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation

Spatial reasoning -- the ability to perceive and reason about relationships in space -- advances vision-language models (VLMs) from visual perception toward spatial semantic understanding. Existing approaches either revisit local image…

Computer Vision and Pattern Recognition · Computer Science 2026-01-06 Weijian Ma , Shizhao Sun , Tianyu Yu , Ruiyu Wang , Tat-Seng Chua , Jiang Bian

SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning

Despite impressive advancements in Visual-Language Models (VLMs) for multi-modal tasks, their reliance on RGB inputs limits precise spatial understanding. Existing methods for integrating spatial cues, such as point clouds or depth, either…

Computer Vision and Pattern Recognition · Computer Science 2025-10-27 Yang Liu , Ming Ma , Xiaomin Yu , Pengxiang Ding , Han Zhao , Mingyang Sun , Siteng Huang , Donglin Wang

SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial…

Computer Vision and Pattern Recognition · Computer Science 2026-05-05 Jian Zhang , Shijie Zhou , Bangya Liu , Achuta Kadambi , Zhiwen Fan