English
Related papers

Related papers: Exploring Spatial Intelligence from a Generative P…

200 papers

Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial…

Computer Vision and Pattern Recognition · Computer Science 2026-05-26 Sihan Yang , Runsen Xu , Yiman Xie , Sizhe Yang , Mo Li , Jingli Lin , Chenming Zhu , Xiaochen Chen , Haodong Duan , Xiangyu Yue , Dahua Lin , Tai Wang , Jiangmiao Pang

Multimodal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, the very capability that anchors artificial general intelligence in the…

Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Hongxiang Li , Yaowei Li , Bin Lin , Yuwei Niu , Yuhang Yang , Xiaoshuang Huang , Jiayin Cai , Xiaolong Jiang , Yao Hu , Long Chen

The advent of Unified Multimodal Models (UMMs) signals a paradigm shift in artificial intelligence, moving from passive perception to active, cross-modal generation. Despite their unprecedented ability to synthesize information, a critical…

Artificial Intelligence · Computer Science 2026-01-15 Jingxuan Wei , Caijun Jia , Xi Bai , Xinglong Xu , Siyuan Li , Linzhuang Sun , Bihui Yu , Conghui He , Lijun Wu , Cheng Tan

Spatial cognition is fundamental to real-world multimodal intelligence, allowing models to effectively interact with the physical environment. While multimodal large language models (MLLMs) have made significant strides, existing benchmarks…

Artificial Intelligence · Computer Science 2026-05-08 Peiran Xu , Sudong Wang , Yao Zhu , Jianing Li , Gege Qi , Yunjian Zhang

Reasoning about dynamic spatial relationships is essential, as both observers and objects often move simultaneously. Although vision-language models (VLMs) and visual expertise models excel in 2D tasks and static scenarios, their ability to…

Computer Vision and Pattern Recognition · Computer Science 2025-10-22 Ziang Zhang , Zehan Wang , Guanghao Zhang , Weilong Dai , Yan Xia , Ziang Yan , Minjie Hong , Zhou Zhao

Spatial intelligence is crucial for vision--language models (VLMs) in the physical world, yet many benchmarks evaluate largely unconstrained scenes where models can exploit 2D shortcuts. We introduce SSI-Bench, a VQA benchmark for spatial…

Computer Vision and Pattern Recognition · Computer Science 2026-02-10 Chen Yang , Guanxin Lin , Youquan He , Peiyao Chen , Guanghe Liu , Yufan Mo , Zhouyuan Xu , Linhao Wang , Guohui Zhang , Zihang Zhang , Shenxiang Zeng , Chen Wang , Jiansheng Fan

Spatial reasoning, which requires ability to perceive and manipulate spatial relationships in the 3D world, is a fundamental aspect of human intelligence, yet remains a persistent challenge for Multimodal large language models (MLLMs).…

Artificial Intelligence · Computer Science 2025-11-21 Weichen Liu , Qiyao Xue , Haoming Wang , Xiangyu Yin , Boyuan Yang , Wei Gao

Accurate uncertainty quantification is crucial for making reliable decisions in various supervised learning scenarios, particularly when dealing with complex, multimodal data such as images and text. Current approaches often face notable…

Machine Learning · Statistics 2026-03-30 Xinyu Tian , Xiaotong Shen

How to integrate and verify spatial intelligence in foundation models remains an open challenge. Current practice often proxies Visual-Spatial Intelligence (VSI) with purely textual prompts and VQA-style scoring, which obscures geometry,…

Computer Vision and Pattern Recognition · Computer Science 2025-10-27 Guanlin Wu , Boyan Su , Yang Zhao , Pu Wang , Yichen Lin , Hao Frank Yang

We introduce Blueprint-Bench, a benchmark designed to evaluate spatial reasoning capabilities in AI models through the task of converting apartment photographs into accurate 2D floor plans. While the input modality (photographs) is well…

Artificial Intelligence · Computer Science 2025-10-01 Lukas Petersson , Axel Backlund , Axel Wennstöm , Hanna Petersson , Callum Sharrock , Arash Dabiri

Humans can intuitively compose and arrange scenes in the 3D space for photography. However, can advanced AI image generators plan scenes with similar 3D spatial awareness when creating images from text or image prompts? We present GenSpace,…

Computer Vision and Pattern Recognition · Computer Science 2025-06-09 Zehan Wang , Jiayang Xu , Ziang Zhang , Tianyu Pang , Chao Du , Hengshuang Zhao , Zhou Zhao

Spatial intelligence, which refers to the ability to reason about geometric and physical structure from visual observations, remains a core challenge for multimodal large language models. Despite promising performance, recent multimodal…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Yian Li , Yang Jiao , Bin Zhu , Tianwen Qian , Shaoxiang Chen , Jingjing Chen , Yu-Gang Jiang

Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the…

The use of Multimodal Large Language Models (MLLMs) as an end-to-end solution for Embodied AI and Autonomous Driving has become a prevailing trend. While MLLMs have been extensively studied for visual semantic understanding tasks, their…

Computer Vision and Pattern Recognition · Computer Science 2025-07-18 Yun Li , Yiming Zhang , Tao Lin , Xiangrui Liu , Wenxiao Cai , Zheng Liu , Bo Zhao

Large Language Models (LLMs) have undergone rapid progress, largely attributed to reinforcement learning on complex reasoning tasks. In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world…

Computer Vision and Pattern Recognition · Computer Science 2026-04-15 Zijian Song , Xiaoxin Lin , Qiuming Huang , Sihan Qin , Guangrun Wang , Liang Lin

The rapid advancement of autonomous systems, including self-driving vehicles and drones, has intensified the need to forge true Spatial Intelligence from multi-modal onboard sensor data. While foundation models excel in single-modal…

Computer Vision and Pattern Recognition · Computer Science 2026-01-09 Song Wang , Lingdong Kong , Xiaolu Liu , Hao Shi , Wentong Li , Jianke Zhu , Steven C. H. Hoi

Humans possess spatial reasoning abilities that enable them to understand spaces through multimodal observations, such as vision and sound. Large multimodal reasoning models extend these abilities by learning to perceive and reason, showing…

Spatial intelligence is important in Architecture, Construction, Science, Technology, Engineering, and Mathematics (STEM), and Medicine. Understanding three-dimensional (3D) spatial rotations can involve verbal descriptions and visual or…

Artificial Intelligence · Computer Science 2025-03-18 Uttamasha Monjoree , Wei Yan

Understanding and replicating the real world is a critical challenge in Artificial General Intelligence (AGI) research. To achieve this, many existing approaches, such as world models, aim to capture the fundamental principles governing the…

Computer Vision and Pattern Recognition · Computer Science 2026-02-17 Yuqi Hu , Longguang Wang , Xian Liu , Ling-Hao Chen , Yuwei Guo , Yukai Shi , Ce Liu , Anyi Rao , Zeyu Wang , Hui Xiong
‹ Prev 1 2 3 10 Next ›