English
Related papers

Related papers: MLLM-4D: Towards Visual-based Spatial-Temporal Int…

200 papers

Vision language models (VLMs) have shown remarkable capabilities in integrating linguistic and visual reasoning but remain fundamentally limited in understanding dynamic spatiotemporal interactions. Humans effortlessly track and reason…

Computer Vision and Pattern Recognition · Computer Science 2025-08-08 Shijie Zhou , Alexander Vilesov , Xuehai He , Ziyu Wan , Shuwang Zhang , Aditya Nagachandra , Di Chang , Dongdong Chen , Xin Eric Wang , Achuta Kadambi

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or…

Computer Vision and Pattern Recognition · Computer Science 2026-05-20 Diankun Wu , Fangfu Liu , Yi-Hsin Hung , Yueqi Duan

4D spatial intelligence involves perceiving and processing how objects move or change over time. Humans naturally possess 4D spatial intelligence, supporting a broad spectrum of spatial reasoning abilities. To what extent can Multimodal…

Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning…

Computer Vision and Pattern Recognition · Computer Science 2026-05-25 Zhangquan Chen , Manyuan Zhang , Xinlei Yu , Xiang An , Bo Li , Xin Xie , ZiDong Wang , Mingze Sun , Shuang Chen , Hongyu Li , Xiaobin Hu , Ruqi Huang

Humans naturally understand 3D spatial relationships, enabling complex reasoning like predicting collisions of vehicles from different directions. Current large multimodal models (LMMs), however, lack of this capability of 3D spatial…

Computer Vision and Pattern Recognition · Computer Science 2025-06-11 Wufei Ma , Luoxin Ye , Celso M de Melo , Jieneng Chen , Alan Yuille

Humans perceive and reason about their surroundings in four dimensions by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to…

Computer Vision and Pattern Recognition · Computer Science 2025-12-19 Tin Stribor Sohn , Maximilian Dillitzer , Jason J. Corso , Eric Sax

Spatiotemporal reasoning is a fundamental capability for artificial intelligence (AI) in soft tissue surgery, paving the way for intelligent assistive systems and autonomous robotics. While 2D vision-language models show increasing promise…

Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific…

Computer Vision and Pattern Recognition · Computer Science 2025-11-11 Hunar Batra , Haoqin Tu , Hardy Chen , Yuanze Lin , Cihang Xie , Ronald Clark

Humans excel at spatial-temporal reasoning, effortlessly interpreting dynamic visual events from an egocentric viewpoint. However, whether multimodal large language models (MLLMs) can similarly understand the 4D world remains uncertain.…

Computer Vision and Pattern Recognition · Computer Science 2025-04-24 Peiran Wu , Yunze Liu , Miao Liu , Junxiao Shen

3D Visual Grounding (3DVG) focuses on locating objects in 3D scenes based on natural language descriptions, serving as a fundamental task for embodied AI and robotics. Recent advances in Multi-modal Large Language Models (MLLMs) have…

Computer Vision and Pattern Recognition · Computer Science 2025-12-02 Beining Xu , Siting Zhu , Zhao Jin , Junxian Li , Hesheng Wang

Understanding dynamic outdoor environments requires capturing complex object interactions and their evolution over time. LiDAR-based 4D point clouds provide precise spatial geometry and rich temporal cues, making them ideal for representing…

Computer Vision and Pattern Recognition · Computer Science 2026-05-13 Changho Choi , Youngwoo Shin , Gyojin Han , Dong-Jae Lee , Junmo Kim

Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by…

Computer Vision and Pattern Recognition · Computer Science 2025-03-27 Dohwan Ko , Sihyeon Kim , Yumin Suh , Vijay Kumar B. G , Minseo Yoon , Manmohan Chandraker , Hyunwoo J. Kim

Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large…

Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in general vision-language tasks. However, recent studies have exposed critical limitations in their spatial reasoning capabilities. This deficiency in…

Machine Learning · Computer Science 2025-06-04 Huanyu Zhang , Chengzu Li , Wenshan Wu , Shaoguang Mao , Yifan Zhang , Haochen Tian , Ivan Vulić , Zhang Zhang , Liang Wang , Tieniu Tan , Furu Wei

Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test-time reasoning. However, extending these capabilities to Multi-modal Large Language Models (MLLMs) remains a significant…

Computer Vision and Pattern Recognition · Computer Science 2026-03-20 Yuhao Dong , Zuyan Liu , Shulin Tian , Yongming Rao , Ziwei Liu

Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders,…

Computer Vision and Pattern Recognition · Computer Science 2025-11-10 Rui Yang , Ziyu Zhu , Yanwei Li , Jingjia Huang , Shen Yan , Siyuan Zhou , Zhe Liu , Xiangtai Li , Shuangye Li , Wenqian Wang , Yi Lin , Hengshuang Zhao

Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also ``think in space'' from videos? We…

Computer Vision and Pattern Recognition · Computer Science 2025-07-04 Jihan Yang , Shusheng Yang , Anjali W. Gupta , Rilyn Han , Li Fei-Fei , Saining Xie

Unlocking spatial reasoning in Multimodal Large Language Models (MLLMs) is crucial for enabling intelligent interaction with 3D environments. While prior efforts often rely on explicit 3D inputs or specialized model architectures, we ask:…

Computer Vision and Pattern Recognition · Computer Science 2025-11-06 Fangrui Zhu , Hanhui Wang , Yiming Xie , Jing Gu , Tianye Ding , Jianwei Yang , Huaizu Jiang

Humans inhabit a physical 4D world where geometric structure and semantic content evolve over time, constituting a dynamic 4D reality (spatial with temporal dimension). While current Multimodal Large Language Models (MLLMs) excel in static…

Computer Vision and Pattern Recognition · Computer Science 2026-03-16 Yuzhi Huang , Kairun Wen , Rongxin Gao , Dongxuan Liu , Yibin Lou , Jie Wu , Jing Xu , Jian Zhang , Zheng Yang , Yunlong Lin , Chenxin Li , Panwang Pan , Junbin Lu , Jingyan Jiang , Xinghao Ding , Yue Huang , Zhi Wang

Despite achieving significant progress in 2D image understanding, large multimodal models (LMMs) struggle in the physical world due to the lack of spatial representation. Typically, existing 3D LMMs mainly embed 3D positions as fixed…

Computer Vision and Pattern Recognition · Computer Science 2025-05-20 Hanyu Zhou , Gim Hee Lee
‹ Prev 1 2 3 10 Next ›