Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark

Pan Wang; Yang Liu; Guile Wu; Eduardo R. Corral-Soto; Chengjie Huang; Binbin Xu; Dongfeng Bai; Xu Yan; Yuan Ren; Xingxin Chen; Yizhe Wu; Tao Huang; Wenjun Wan; Xin Wu; Pei Zhou; Xuyang Dai; Kangbo Lv; Hongbo Zhang; Yosef Fried; Aixue Ye; Bailan Feng; Zhenyu Chen; Zhen Li; Yingcong Chen; Yiyi Liao; Bingbing Liu

Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark

Computer Vision and Pattern Recognition 2026-03-09 v2

Authors: Pan Wang , Yang Liu , Guile Wu , Eduardo R. Corral-Soto , Chengjie Huang , Binbin Xu , Dongfeng Bai , Xu Yan , Yuan Ren , Xingxin Chen , Yizhe Wu , Tao Huang , Wenjun Wan , Xin Wu , Pei Zhou , Xuyang Dai , Kangbo Lv , Hongbo Zhang , Yosef Fried , Aixue Ye , Bailan Feng , Zhenyu Chen , Zhen Li , Yingcong Chen , Yiyi Liao , Bingbing Liu

View on arXiv ↗ PDF ↗

Abstract

4D spatial intelligence involves perceiving and processing how objects move or change over time. Humans naturally possess 4D spatial intelligence, supporting a broad spectrum of spatial reasoning abilities. To what extent can Multimodal Large Language Models (MLLMs) achieve human-level 4D spatial intelligence? In this work, we present Spatial4D-Bench, a versatile 4D spatial intelligence benchmark designed to comprehensively assess the 4D spatial reasoning abilities of MLLMs. Unlike existing spatial intelligence benchmarks that are often small-scale or limited in diversity, Spatial4D-Bench provides a large-scale, multi-task evaluation benchmark consisting of ~40,000 question-answer pairs covering 18 well-defined tasks. We systematically organize these tasks into six cognitive categories: object understanding, scene understanding, spatial relationship understanding, spatiotemporal relationship understanding, spatial reasoning and spatiotemporal reasoning. Spatial4D-Bench thereby offers a structured and comprehensive benchmark for evaluating the spatial cognition abilities of MLLMs, covering a broad spectrum of tasks that parallel the versatility of human spatial intelligence. We benchmark various state-of-the-art open-source and proprietary MLLMs on Spatial4D-Bench and reveal their substantial limitations in a wide variety of 4D spatial reasoning aspects, such as route plan, action recognition, and physical plausibility reasoning. We hope that the findings provided in this work offer valuable insights to the community and that our benchmark can facilitate the development of more capable MLLMs toward human-level 4D spatial intelligence. More resources can be found on our project page.

Keywords

visual reasoning vision-language understanding large language model evaluation

Cite

@article{arxiv.2601.00092,
  title  = {Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark},
  author = {Pan Wang and Yang Liu and Guile Wu and Eduardo R. Corral-Soto and Chengjie Huang and Binbin Xu and Dongfeng Bai and Xu Yan and Yuan Ren and Xingxin Chen and Yizhe Wu and Tao Huang and Wenjun Wan and Xin Wu and Pei Zhou and Xuyang Dai and Kangbo Lv and Hongbo Zhang and Yosef Fried and Aixue Ye and Bailan Feng and Zhenyu Chen and Zhen Li and Yingcong Chen and Yiyi Liao and Bingbing Liu},
  journal= {arXiv preprint arXiv:2601.00092},
  year   = {2026}
}

Comments

Technical Report

Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark

Abstract

Keywords

Cite

Comments

Related papers