Related papers: Spatial4D-Bench: A Versatile 4D Spatial Intelligen…

SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

Spatial cognition is fundamental to real-world multimodal intelligence, allowing models to effectively interact with the physical environment. While multimodal large language models (MLLMs) have made significant strides, existing benchmarks…

Artificial Intelligence · Computer Science 2026-05-08 Peiran Xu , Sudong Wang , Yao Zhu , Jianing Li , Gege Qi , Yunjian Zhang

Spatial Reasoning in Multimodal Large Language Models: A Survey of Tasks, Benchmarks and Methods

Spatial reasoning, which requires ability to perceive and manipulate spatial relationships in the 3D world, is a fundamental aspect of human intelligence, yet remains a persistent challenge for Multimodal large language models (MLLMs).…

Artificial Intelligence · Computer Science 2025-11-21 Weichen Liu , Qiyao Xue , Haoming Wang , Xiangyu Yin , Boyuan Yang , Wei Gao

Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

Humans possess spatial reasoning abilities that enable them to understand spaces through multimodal observations, such as vision and sound. Large multimodal reasoning models extend these abilities by learning to perceive and reason, showing…

Computer Vision and Pattern Recognition · Computer Science 2025-11-04 Xu Zheng , Zihao Dongfang , Lutao Jiang , Boyuan Zheng , Yulong Guo , Zhenquan Zhang , Giuliano Albanese , Runyi Yang , Mengjiao Ma , Zixin Zhang , Chenfei Liao , Dingcheng Zhen , Yuanhuiyi Lyu , Yuqian Fu , Bin Ren , Linfeng Zhang , Danda Pani Paudel , Nicu Sebe , Luc Van Gool , Xuming Hu

11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis

For human cognitive process, spatial reasoning and perception are closely entangled, yet the nature of this interplay remains underexplored in the evaluation of multimodal large language models (MLLMs). While recent MLLM advancements show…

Computation and Language · Computer Science 2025-08-28 Chengzu Li , Wenshan Wu , Huanyu Zhang , Qingtao Li , Zeyu Gao , Yan Xia , José Hernández-Orallo , Ivan Vulić , Furu Wei

3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark

3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space. This allows models to develop a comprehensive understanding of the 3D scene, enabling their…

Computer Vision and Pattern Recognition · Computer Science 2025-09-17 Wufei Ma , Haoyu Chen , Guofeng Zhang , Yu-Cheng Chou , Jieneng Chen , Celso M de Melo , Alan Yuille

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial…

Computer Vision and Pattern Recognition · Computer Science 2026-05-26 Sihan Yang , Runsen Xu , Yiman Xie , Sizhe Yang , Mo Li , Jingli Lin , Chenming Zhu , Xiaochen Chen , Haodong Duan , Xiangyu Yue , Dahua Lin , Tai Wang , Jiangmiao Pang

SpatialViz-Bench: A Cognitively-Grounded Benchmark for Diagnosing Spatial Visualization in MLLMs

Humans can imagine and manipulate visual images mentally, a capability known as spatial visualization. While many multi-modal benchmarks assess reasoning on visible visual information, the ability to infer unseen relationships through…

Computer Vision and Pattern Recognition · Computer Science 2026-03-17 Siting Wang , Minnan Pei , Luoyang Sun , Cheng Deng , Yuchen Li , Kun Shao , Zheng Tian , Haifeng Zhang , Jun Wang

EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery

Benchmarking spatial reasoning in multimodal large language models (MLLMs) has attracted growing interest in computer vision due to its importance for embodied AI and other agentic systems that require precise interaction with the physical…

Computer Vision and Pattern Recognition · Computer Science 2026-02-19 Zelin Xu , Yupu Zhang , Saugat Adhikari , Saiful Islam , Tingsong Xiao , Zibo Liu , Shigang Chen , Da Yan , Zhe Jiang

SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks

Large Language Models (LLMs) have undergone rapid progress, largely attributed to reinforcement learning on complex reasoning tasks. In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world…

Computer Vision and Pattern Recognition · Computer Science 2026-04-15 Zijian Song , Xiaoxin Lin , Qiuming Huang , Sihan Qin , Guangrun Wang , Liang Lin

MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence

Humans are born with vision-based 4D spatial-temporal intelligence, which enables us to perceive and reason about the evolution of 3D space over time from purely visual inputs. Despite its importance, this capability remains a significant…

Computer Vision and Pattern Recognition · Computer Science 2026-03-03 Xingyilang Yin , Chengzhengxu Li , Jiahao Chang , Chi-Man Pun , Xiaodong Cun

Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World

Humans inhabit a physical 4D world where geometric structure and semantic content evolve over time, constituting a dynamic 4D reality (spatial with temporal dimension). While current Multimodal Large Language Models (MLLMs) excel in static…

Computer Vision and Pattern Recognition · Computer Science 2026-03-16 Yuzhi Huang , Kairun Wen , Rongxin Gao , Dongxuan Liu , Yibin Lou , Jie Wu , Jing Xu , Jian Zhang , Zheng Yang , Yunlong Lin , Chenxin Li , Panwang Pan , Junbin Lu , Jingyan Jiang , Xinghao Ding , Yue Huang , Zhi Wang

Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models

Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks…

Computer Vision and Pattern Recognition · Computer Science 2025-10-20 Xingrui Wang , Wufei Ma , Tiezheng Zhang , Celso M de Melo , Jieneng Chen , Alan Yuille

ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models

Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Dingming Li , Hongxing Li , Zixuan Wang , Yuchen Yan , Hang Zhang , Siqi Chen , Guiyang Hou , Shengpei Jiang , Wenqi Zhang , Yongliang Shen , Weiming Lu , Yueting Zhuang

DSI-Bench: A Benchmark for Dynamic Spatial Intelligence

Reasoning about dynamic spatial relationships is essential, as both observers and objects often move simultaneously. Although vision-language models (VLMs) and visual expertise models excel in 2D tasks and static scenarios, their ability to…

Computer Vision and Pattern Recognition · Computer Science 2025-10-22 Ziang Zhang , Zehan Wang , Guanghao Zhang , Weilong Dai , Yan Xia , Ziang Yan , Minjie Hong , Zhou Zhao

STBench: Assessing the Ability of Large Language Models in Spatio-Temporal Analysis

The rapid evolution of large language models (LLMs) holds promise for reforming the methodology of spatio-temporal data mining. However, current works for evaluating the spatio-temporal understanding capability of LLMs are somewhat limited…

Computation and Language · Computer Science 2024-06-28 Wenbin Li , Di Yao , Ruibo Zhao , Wenjie Chen , Zijie Xu , Chengxue Luo , Chang Gong , Quanliang Jing , Haining Tan , Jingping Bi

MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

Spatial understanding over continuous visual input is crucial for MLLMs to evolve into general-purpose assistants in physical environments. Yet there is still no comprehensive benchmark that holistically assesses the progress toward this…

Computer Vision and Pattern Recognition · Computer Science 2025-12-12 Jingli Lin , Runsen Xu , Shaohao Zhu , Sihan Yang , Peizhou Cao , Yunlong Ran , Miao Hu , Chenming Zhu , Yiman Xie , Yilin Long , Wenbo Hu , Dahua Lin , Tai Wang , Jiangmiao Pang

4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding

Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D image/video understanding capabilities. However, there are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects (3D…

Computer Vision and Pattern Recognition · Computer Science 2025-03-25 Wenxuan Zhu , Bing Li , Cheng Zheng , Jinjie Mai , Jun Chen , Letian Jiang , Abdullah Hamdi , Sara Rojas Martinez , Chia-Wen Lin , Mohamed Elhoseiny , Bernard Ghanem

Scaling and Beyond: Advancing Spatial Reasoning in MLLMs Requires New Recipes

Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in general vision-language tasks. However, recent studies have exposed critical limitations in their spatial reasoning capabilities. This deficiency in…

Machine Learning · Computer Science 2025-06-04 Huanyu Zhang , Chengzu Li , Wenshan Wu , Shaoguang Mao , Yifan Zhang , Haochen Tian , Ivan Vulić , Zhang Zhang , Liang Wang , Tieniu Tan , Furu Wei

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also ``think in space'' from videos? We…

Computer Vision and Pattern Recognition · Computer Science 2025-07-04 Jihan Yang , Shusheng Yang , Anjali W. Gupta , Rilyn Han , Li Fei-Fei , Saining Xie

From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs

Human-level agentic intelligence extends beyond low-level geometric perception, evolving from recognizing where things are to understanding what they are for. While existing benchmarks effectively evaluate the geometric perception…

Computer Vision and Pattern Recognition · Computer Science 2026-05-05 Le Zhang , Jihan Yang , Soundarya Krishnan , Jimit Majmudar , Xiou Ge , Prasoon Puri , Prathamesh Nandkishor Saraf , Shruti Bhargava , Dhivya Piraviperumal , Yinan Ling , Cindy Pan , Hong Yu , Aishwarya Agrawal , Bo-Hsiang Tseng