Related papers: MLLM-4D: Towards Visual-based Spatial-Temporal Int…

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

Vision language models (VLMs) have shown remarkable capabilities in integrating linguistic and visual reasoning but remain fundamentally limited in understanding dynamic spatiotemporal interactions. Humans effortlessly track and reason…

Computer Vision and Pattern Recognition · Computer Science 2025-08-08 Shijie Zhou , Alexander Vilesov , Xuehai He , Ziyu Wan , Shuwang Zhang , Aditya Nagachandra , Di Chang , Dongdong Chen , Xin Eric Wang , Achuta Kadambi

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or…

Computer Vision and Pattern Recognition · Computer Science 2026-05-20 Diankun Wu , Fangfu Liu , Yi-Hsin Hung , Yueqi Duan

Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark

4D spatial intelligence involves perceiving and processing how objects move or change over time. Humans naturally possess 4D spatial intelligence, supporting a broad spectrum of spatial reasoning abilities. To what extent can Multimodal…

Computer Vision and Pattern Recognition · Computer Science 2026-03-09 Pan Wang , Yang Liu , Guile Wu , Eduardo R. Corral-Soto , Chengjie Huang , Binbin Xu , Dongfeng Bai , Xu Yan , Yuan Ren , Xingxin Chen , Yizhe Wu , Tao Huang , Wenjun Wan , Xin Wu , Pei Zhou , Xuyang Dai , Kangbo Lv , Hongbo Zhang , Yosef Fried , Aixue Ye , Bailan Feng , Zhenyu Chen , Zhen Li , Yingcong Chen , Yiyi Liao , Bingbing Liu

4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning…

Computer Vision and Pattern Recognition · Computer Science 2026-05-25 Zhangquan Chen , Manyuan Zhang , Xinlei Yu , Xiang An , Bo Li , Xin Xie , ZiDong Wang , Mingze Sun , Shuang Chen , Hongyu Li , Xiaobin Hu , Ruqi Huang

SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models

Humans naturally understand 3D spatial relationships, enabling complex reasoning like predicting collisions of vehicles from different directions. Current large multimodal models (LMMs), however, lack of this capability of 3D spatial…

Computer Vision and Pattern Recognition · Computer Science 2025-06-11 Wufei Ma , Luoxin Ye , Celso M de Melo , Jieneng Chen , Alan Yuille

R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space

Humans perceive and reason about their surroundings in four dimensions by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to…

Computer Vision and Pattern Recognition · Computer Science 2025-12-19 Tin Stribor Sohn , Maximilian Dillitzer , Jason J. Corso , Eric Sax

A 4D Representation for Training-Free Agentic Reasoning from Monocular Laparoscopic Video

Spatiotemporal reasoning is a fundamental capability for artificial intelligence (AI) in soft tissue surgery, paving the way for intelligent assistive systems and autonomous robotics. While 2D vision-language models show increasing promise…

Computer Vision and Pattern Recognition · Computer Science 2026-04-02 Maximilian Fehrentz , Nicolas Stellwag , Robert Wiebe , Nicole Thorisch , Fabian Grob , Patrick Remerscheid , Ken-Joel Simmoteit , Benjamin D. Killeen , Christian Heiliger , Nassir Navab

SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards

Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific…

Computer Vision and Pattern Recognition · Computer Science 2025-11-11 Hunar Batra , Haoqin Tu , Hardy Chen , Yuanze Lin , Cihang Xie , Ronald Clark

ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos

Humans excel at spatial-temporal reasoning, effortlessly interpreting dynamic visual events from an egocentric viewpoint. However, whether multimodal large language models (MLLMs) can similarly understand the 4D world remains uncertain.…

Computer Vision and Pattern Recognition · Computer Science 2025-04-24 Peiran Wu , Yunze Liu , Miao Liu , Junxiao Shen

S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance

3D Visual Grounding (3DVG) focuses on locating objects in 3D scenes based on natural language descriptions, serving as a fundamental task for embodied AI and robotics. Recent advances in Multi-modal Large Language Models (MLLMs) have…

Computer Vision and Pattern Recognition · Computer Science 2025-12-02 Beining Xu , Siting Zhu , Zhao Jin , Junxian Li , Hesheng Wang

B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding

Understanding dynamic outdoor environments requires capturing complex object interactions and their evolution over time. LiDAR-based 4D point clouds provide precise spatial geometry and rich temporal cues, making them ideal for representing…

Computer Vision and Pattern Recognition · Computer Science 2026-05-13 Changho Choi , Youngwoo Shin , Gyojin Han , Dong-Jae Lee , Junmo Kim

ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models

Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by…

Computer Vision and Pattern Recognition · Computer Science 2025-03-27 Dohwan Ko , Sihyeon Kim , Yumin Suh , Vijay Kumar B. G , Minseo Yoon , Manmohan Chandraker , Hyunwoo J. Kim

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large…

Computer Vision and Pattern Recognition · Computer Science 2025-11-26 Yolo Y. Tang , Jing Bi , Pinxin Liu , Zhenyu Pan , Zhangyun Tan , Qianxiang Shen , Jiani Liu , Hang Hua , Junjia Guo , Yunzhong Xiao , Chao Huang , Zhiyuan Wang , Susan Liang , Xinyi Liu , Yizhi Song , Junhua Huang , Jia-Xing Zhong , Bozheng Li , Daiqing Qi , Ziyun Zeng , Ali Vosoughi , Luchuan Song , Zeliang Zhang , Daiki Shimada , Han Liu , Jiebo Luo , Chenliang Xu

Scaling and Beyond: Advancing Spatial Reasoning in MLLMs Requires New Recipes

Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in general vision-language tasks. However, recent studies have exposed critical limitations in their spatial reasoning capabilities. This deficiency in…

Machine Learning · Computer Science 2025-06-04 Huanyu Zhang , Chengzu Li , Wenshan Wu , Shaoguang Mao , Yifan Zhang , Haochen Tian , Ivan Vulić , Zhang Zhang , Liang Wang , Tieniu Tan , Furu Wei

Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test-time reasoning. However, extending these capabilities to Multi-modal Large Language Models (MLLMs) remains a significant…

Computer Vision and Pattern Recognition · Computer Science 2026-03-20 Yuhao Dong , Zuyan Liu , Shulin Tian , Yongming Rao , Ziwei Liu

Visual Spatial Tuning

Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders,…

Computer Vision and Pattern Recognition · Computer Science 2025-11-10 Rui Yang , Ziyu Zhu , Yanwei Li , Jingjia Huang , Shen Yan , Siyuan Zhou , Zhe Liu , Xiangtai Li , Shuangye Li , Wenqian Wang , Yi Lin , Hengshuang Zhao

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also ``think in space'' from videos? We…

Computer Vision and Pattern Recognition · Computer Science 2025-07-04 Jihan Yang , Shusheng Yang , Anjali W. Gupta , Rilyn Han , Li Fei-Fei , Saining Xie

Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs

Unlocking spatial reasoning in Multimodal Large Language Models (MLLMs) is crucial for enabling intelligent interaction with 3D environments. While prior efforts often rely on explicit 3D inputs or specialized model architectures, we ask:…

Computer Vision and Pattern Recognition · Computer Science 2025-11-06 Fangrui Zhu , Hanhui Wang , Yiming Xie , Jing Gu , Tianye Ding , Jianwei Yang , Huaizu Jiang

Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World

Humans inhabit a physical 4D world where geometric structure and semantic content evolve over time, constituting a dynamic 4D reality (spatial with temporal dimension). While current Multimodal Large Language Models (MLLMs) excel in static…

Computer Vision and Pattern Recognition · Computer Science 2026-03-16 Yuzhi Huang , Kairun Wen , Rongxin Gao , Dongxuan Liu , Yibin Lou , Jie Wu , Jing Xu , Jian Zhang , Zheng Yang , Yunlong Lin , Chenxin Li , Panwang Pan , Junbin Lu , Jingyan Jiang , Xinghao Ding , Yue Huang , Zhi Wang

LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding

Despite achieving significant progress in 2D image understanding, large multimodal models (LMMs) struggle in the physical world due to the lack of spatial representation. Typically, existing 3D LMMs mainly embed 3D positions as fixed…

Computer Vision and Pattern Recognition · Computer Science 2025-05-20 Hanyu Zhou , Gim Hee Lee