English
Related papers

Related papers: Spatial4D-Bench: A Versatile 4D Spatial Intelligen…

200 papers

Spatial cognition is fundamental to real-world multimodal intelligence, allowing models to effectively interact with the physical environment. While multimodal large language models (MLLMs) have made significant strides, existing benchmarks…

Artificial Intelligence · Computer Science 2026-05-08 Peiran Xu , Sudong Wang , Yao Zhu , Jianing Li , Gege Qi , Yunjian Zhang

Spatial reasoning, which requires ability to perceive and manipulate spatial relationships in the 3D world, is a fundamental aspect of human intelligence, yet remains a persistent challenge for Multimodal large language models (MLLMs).…

Artificial Intelligence · Computer Science 2025-11-21 Weichen Liu , Qiyao Xue , Haoming Wang , Xiangyu Yin , Boyuan Yang , Wei Gao

Humans possess spatial reasoning abilities that enable them to understand spaces through multimodal observations, such as vision and sound. Large multimodal reasoning models extend these abilities by learning to perceive and reason, showing…

For human cognitive process, spatial reasoning and perception are closely entangled, yet the nature of this interplay remains underexplored in the evaluation of multimodal large language models (MLLMs). While recent MLLM advancements show…

Computation and Language · Computer Science 2025-08-28 Chengzu Li , Wenshan Wu , Huanyu Zhang , Qingtao Li , Zeyu Gao , Yan Xia , José Hernández-Orallo , Ivan Vulić , Furu Wei

3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space. This allows models to develop a comprehensive understanding of the 3D scene, enabling their…

Computer Vision and Pattern Recognition · Computer Science 2025-09-17 Wufei Ma , Haoyu Chen , Guofeng Zhang , Yu-Cheng Chou , Jieneng Chen , Celso M de Melo , Alan Yuille

Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial…

Computer Vision and Pattern Recognition · Computer Science 2026-05-26 Sihan Yang , Runsen Xu , Yiman Xie , Sizhe Yang , Mo Li , Jingli Lin , Chenming Zhu , Xiaochen Chen , Haodong Duan , Xiangyu Yue , Dahua Lin , Tai Wang , Jiangmiao Pang

Humans can imagine and manipulate visual images mentally, a capability known as spatial visualization. While many multi-modal benchmarks assess reasoning on visible visual information, the ability to infer unseen relationships through…

Computer Vision and Pattern Recognition · Computer Science 2026-03-17 Siting Wang , Minnan Pei , Luoyang Sun , Cheng Deng , Yuchen Li , Kun Shao , Zheng Tian , Haifeng Zhang , Jun Wang

Benchmarking spatial reasoning in multimodal large language models (MLLMs) has attracted growing interest in computer vision due to its importance for embodied AI and other agentic systems that require precise interaction with the physical…

Computer Vision and Pattern Recognition · Computer Science 2026-02-19 Zelin Xu , Yupu Zhang , Saugat Adhikari , Saiful Islam , Tingsong Xiao , Zibo Liu , Shigang Chen , Da Yan , Zhe Jiang

Large Language Models (LLMs) have undergone rapid progress, largely attributed to reinforcement learning on complex reasoning tasks. In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world…

Computer Vision and Pattern Recognition · Computer Science 2026-04-15 Zijian Song , Xiaoxin Lin , Qiuming Huang , Sihan Qin , Guangrun Wang , Liang Lin

Humans are born with vision-based 4D spatial-temporal intelligence, which enables us to perceive and reason about the evolution of 3D space over time from purely visual inputs. Despite its importance, this capability remains a significant…

Computer Vision and Pattern Recognition · Computer Science 2026-03-03 Xingyilang Yin , Chengzhengxu Li , Jiahao Chang , Chi-Man Pun , Xiaodong Cun

Humans inhabit a physical 4D world where geometric structure and semantic content evolve over time, constituting a dynamic 4D reality (spatial with temporal dimension). While current Multimodal Large Language Models (MLLMs) excel in static…

Computer Vision and Pattern Recognition · Computer Science 2026-03-16 Yuzhi Huang , Kairun Wen , Rongxin Gao , Dongxuan Liu , Yibin Lou , Jie Wu , Jing Xu , Jian Zhang , Zheng Yang , Yunlong Lin , Chenxin Li , Panwang Pan , Junbin Lu , Jingyan Jiang , Xinghao Ding , Yue Huang , Zhi Wang

Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks…

Computer Vision and Pattern Recognition · Computer Science 2025-10-20 Xingrui Wang , Wufei Ma , Tiezheng Zhang , Celso M de Melo , Jieneng Chen , Alan Yuille

Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Dingming Li , Hongxing Li , Zixuan Wang , Yuchen Yan , Hang Zhang , Siqi Chen , Guiyang Hou , Shengpei Jiang , Wenqi Zhang , Yongliang Shen , Weiming Lu , Yueting Zhuang

Reasoning about dynamic spatial relationships is essential, as both observers and objects often move simultaneously. Although vision-language models (VLMs) and visual expertise models excel in 2D tasks and static scenarios, their ability to…

Computer Vision and Pattern Recognition · Computer Science 2025-10-22 Ziang Zhang , Zehan Wang , Guanghao Zhang , Weilong Dai , Yan Xia , Ziang Yan , Minjie Hong , Zhou Zhao

The rapid evolution of large language models (LLMs) holds promise for reforming the methodology of spatio-temporal data mining. However, current works for evaluating the spatio-temporal understanding capability of LLMs are somewhat limited…

Computation and Language · Computer Science 2024-06-28 Wenbin Li , Di Yao , Ruibo Zhao , Wenjie Chen , Zijie Xu , Chengxue Luo , Chang Gong , Quanliang Jing , Haining Tan , Jingping Bi

Spatial understanding over continuous visual input is crucial for MLLMs to evolve into general-purpose assistants in physical environments. Yet there is still no comprehensive benchmark that holistically assesses the progress toward this…

Computer Vision and Pattern Recognition · Computer Science 2025-12-12 Jingli Lin , Runsen Xu , Shaohao Zhu , Sihan Yang , Peizhou Cao , Yunlong Ran , Miao Hu , Chenming Zhu , Yiman Xie , Yilin Long , Wenbo Hu , Dahua Lin , Tai Wang , Jiangmiao Pang

Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D image/video understanding capabilities. However, there are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects (3D…

Computer Vision and Pattern Recognition · Computer Science 2025-03-25 Wenxuan Zhu , Bing Li , Cheng Zheng , Jinjie Mai , Jun Chen , Letian Jiang , Abdullah Hamdi , Sara Rojas Martinez , Chia-Wen Lin , Mohamed Elhoseiny , Bernard Ghanem

Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in general vision-language tasks. However, recent studies have exposed critical limitations in their spatial reasoning capabilities. This deficiency in…

Machine Learning · Computer Science 2025-06-04 Huanyu Zhang , Chengzu Li , Wenshan Wu , Shaoguang Mao , Yifan Zhang , Haochen Tian , Ivan Vulić , Zhang Zhang , Liang Wang , Tieniu Tan , Furu Wei

Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also ``think in space'' from videos? We…

Computer Vision and Pattern Recognition · Computer Science 2025-07-04 Jihan Yang , Shusheng Yang , Anjali W. Gupta , Rilyn Han , Li Fei-Fei , Saining Xie

Human-level agentic intelligence extends beyond low-level geometric perception, evolving from recognizing where things are to understanding what they are for. While existing benchmarks effectively evaluate the geometric perception…

‹ Prev 1 2 3 10 Next ›