English
Related papers

Related papers: SpatialBot: Precise Spatial Understanding with Vis…

200 papers

Embodied intelligence fundamentally requires a capability to determine where to act in 3D space. We formalize this requirement as embodied localization -- the problem of predicting executable 3D points conditioned on visual observations and…

Robotics · Computer Science 2026-03-31 Qiming Zhu , Zhirui Fang , Tianming Zhang , Chuanxiu Liu , Xiaoke Jiang , Lei Zhang

Vision-language models (VLMs) have advanced multimodal reasoning but still face challenges in spatial reasoning for 3D scenes and complex object configurations. To address this, we introduce SpatialViLT, an enhanced VLM that integrates…

Computer Vision and Pattern Recognition · Computer Science 2025-10-07 Chashi Mahiul Islam , Oteo Mamo , Samuel Jacob Chacko , Xiuwen Liu , Weikuan Yu

Understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. While Vision Language Models (VLM) have demonstrated remarkable performance in certain VQA benchmarks,…

Computer Vision and Pattern Recognition · Computer Science 2024-01-23 Boyuan Chen , Zhuo Xu , Sean Kirmani , Brian Ichter , Danny Driess , Pete Florence , Dorsa Sadigh , Leonidas Guibas , Fei Xia

The recent rapid development of Large Vision-Language Models (LVLMs) has indicated their potential for embodied tasks.However, the critical skill of spatial understanding in embodied environments has not been thoroughly evaluated, leaving…

Artificial Intelligence · Computer Science 2024-06-11 Mengfei Du , Binhao Wu , Zejun Li , Xuanjing Huang , Zhongyu Wei

This thesis introduces "Embodied Spatial Intelligence" to address the challenge of creating robots that can perceive and act in the real world based on natural language instructions. To bridge the gap between Large Language Models (LLMs)…

Robotics · Computer Science 2025-09-03 Jiading Fang

While vision language models (VLMs) excel in 2D semantic visual understanding, their ability to quantitatively reason about 3D spatial relationships remains under-explored, due to the deficiency of 2D images' spatial representation ability.…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Pingyi Chen , Yujing Lou , Shen Cao , Jinhui Guo , Lubin Fan , Yue Wu , Lin Yang , Lizhuang Ma , Jieping Ye

Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision and language tasks. However, their ability to reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT (SpatialRGPT)…

Computer Vision and Pattern Recognition · Computer Science 2024-10-16 An-Chieh Cheng , Hongxu Yin , Yang Fu , Qiushan Guo , Ruihan Yang , Jan Kautz , Xiaolong Wang , Sifei Liu

Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet most evaluations of…

Despite notable advancements in remote sensing vision-language models (VLMs), existing models often struggle with spatial understanding, limiting their effectiveness in real-world applications. To push the boundaries of VLMs in remote…

Computer Vision and Pattern Recognition · Computer Science 2026-01-06 Yue Zhou , Ran Ding , Xue Yang , Xue Jiang , Xingzhao Liu

Spatial reasoning is an essential problem in embodied AI research. Efforts to enhance spatial reasoning abilities through supplementary spatial data and fine-tuning have proven limited and ineffective when addressing complex embodied tasks,…

Vision Language Models (VLMs) demonstrate significant potential as embodied AI agents for various mobility applications. However, a standardized, closed-loop benchmark for evaluating their spatial reasoning and sequential decision-making…

Computer Vision and Pattern Recognition · Computer Science 2025-01-17 Weizhen Wang , Chenda Duan , Zhenghao Peng , Yuxin Liu , Bolei Zhou

Vision-language models (VLMs) have advanced rapidly, but their ability to capture spatial relationships remains a blindspot. Current VLMs are typically built with contrastive language-image pretraining (CLIP) style image encoders. The…

Computer Vision and Pattern Recognition · Computer Science 2026-01-26 Nahid Alam , Leema Krishna Murali , Siddhant Bharadwaj , Patrick Liu , Timothy Chung , Drishti Sharma , Akshata A , Kranthi Kiran , Wesley Tam , Bala Krishna S Vegesna

Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Dingming Li , Hongxing Li , Zixuan Wang , Yuchen Yan , Hang Zhang , Siqi Chen , Guiyang Hou , Shengpei Jiang , Wenqi Zhang , Yongliang Shen , Weiming Lu , Yueting Zhuang

Visual-spatial understanding, the ability to infer object relationships and layouts from visual input, is fundamental to downstream tasks such as robotic navigation and embodied interaction. However, existing methods face spatial…

Computer Vision and Pattern Recognition · Computer Science 2025-09-22 Haoyu Zhang , Meng Liu , Zaijing Li , Haokun Wen , Weili Guan , Yaowei Wang , Liqiang Nie

We present a conceptual framework for training Vision-Language Models (VLMs) to perform Visual Perspective Taking (VPT), a core capability for embodied cognition essential for Human-Robot Interaction (HRI). As a first step toward this goal,…

Artificial Intelligence · Computer Science 2025-05-21 Joel Currie , Gioele Migno , Enrico Piacenti , Maria Elena Giannaccini , Patric Bach , Davide De Tommaso , Agnieszka Wykowska

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or…

Computer Vision and Pattern Recognition · Computer Science 2026-05-20 Diankun Wu , Fangfu Liu , Yi-Hsin Hung , Yueqi Duan

Are current Vision Language Models (VLMs) ready to comprehend and reason about complex embodied interactions in 3D environments? We introduce Embodied3DBench, a robot-centric benchmark targeting low-level spatial intelligence in embodied 3D…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Jiyao Zhang , Mingxu Zhang , Yitong Peng , Haoxuan Liu , Chenshuo Wang , Yuxing Long , Haoyang Huang , Dongjiang Li , Nan Duan , Hui Shen , Hao Dong

Real-world applications, such as autonomous driving and humanoid robot manipulation, require precise spatial perception. However, it remains underexplored how Vision-Language Models (VLMs) recognize spatial relationships and perceive…

Computer Vision and Pattern Recognition · Computer Science 2026-02-24 Fei Kong , Jinhao Duan , Kaidi Xu , Zhenhua Guo , Xiaofeng Zhu , Xiaoshuang Shi

Multimodal Small-to-Medium sized Language Models (MSLMs) have demonstrated strong capabilities in integrating visual and textual information but still face significant limitations in visual comprehension and mathematical reasoning,…

Machine Learning · Computer Science 2026-01-27 Ashutosh Bajpai , Akshat Bhandari , Akshay Nambi , Tanmoy Chakraborty

The use of Multimodal Large Language Models (MLLMs) as an end-to-end solution for Embodied AI and Autonomous Driving has become a prevailing trend. While MLLMs have been extensively studied for visual semantic understanding tasks, their…

Computer Vision and Pattern Recognition · Computer Science 2025-07-18 Yun Li , Yiming Zhang , Tao Lin , Xiangrui Liu , Wenxiao Cai , Zheng Liu , Bo Zhao
‹ Prev 1 2 3 10 Next ›