English
Related papers

Related papers: Spatial-ViLT: Enhancing Visual Spatial Reasoning t…

200 papers

The spatial reasoning task aims to reason about the spatial relationships in 2D and 3D space, which is a fundamental capability for Visual Question Answering (VQA) and robotics. Although vision language models (VLMs) have developed rapidly…

Computer Vision and Pattern Recognition · Computer Science 2025-07-29 Xun Liang , Xin Guo , Zhongming Jin , Weihang Pan , Penghui Shang , Deng Cai , Binbin Lin , Jieping Ye

New era has unlocked exciting possibilities for extending Large Language Models (LLMs) to tackle 3D vision-language tasks. However, most existing 3D multimodal LLMs (MLLMs) rely on compressing holistic 3D scene information or segmenting…

Computer Vision and Pattern Recognition · Computer Science 2025-07-23 Xiaoyan Wang , Zeju Li , Yifan Xu , Jiaxing Qi , Zhifei Yang , Ruifei Ma , Xiangde Liu , Chao Zhang

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or…

Computer Vision and Pattern Recognition · Computer Science 2026-05-20 Diankun Wu , Fangfu Liu , Yi-Hsin Hung , Yueqi Duan

Robot vision has greatly benefited from advancements in multimodal fusion techniques and vision-language models (VLMs). We adopt a task-oriented perspective to systematically review the applications and advancements of multimodal fusion…

Multimodal Small-to-Medium sized Language Models (MSLMs) have demonstrated strong capabilities in integrating visual and textual information but still face significant limitations in visual comprehension and mathematical reasoning,…

Machine Learning · Computer Science 2026-01-27 Ashutosh Bajpai , Akshat Bhandari , Akshay Nambi , Tanmoy Chakraborty

The rapid progress of Multimodal Large Language Models (MLLMs) has unlocked the potential for enhanced 3D scene understanding and spatial reasoning. A recent line of work explores learning spatial reasoning directly from multi-view images,…

Computer Vision and Pattern Recognition · Computer Science 2026-04-10 Kanghee Lee , Injae Lee , Minseok Kwak , Jungi Hong , Kwonyoung Ryu , Jaesik Park

Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Dingming Li , Hongxing Li , Zixuan Wang , Yuchen Yan , Hang Zhang , Siqi Chen , Guiyang Hou , Shengpei Jiang , Wenqi Zhang , Yongliang Shen , Weiming Lu , Yueting Zhuang

Understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. While Vision Language Models (VLM) have demonstrated remarkable performance in certain VQA benchmarks,…

Computer Vision and Pattern Recognition · Computer Science 2024-01-23 Boyuan Chen , Zhuo Xu , Sean Kirmani , Brian Ichter , Danny Driess , Pete Florence , Dorsa Sadigh , Leonidas Guibas , Fei Xia

Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision and language tasks. However, their ability to reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT (SpatialRGPT)…

Computer Vision and Pattern Recognition · Computer Science 2024-10-16 An-Chieh Cheng , Hongxu Yin , Yang Fu , Qiushan Guo , Ruihan Yang , Jan Kautz , Xiaolong Wang , Sifei Liu

Spatial reasoning is an essential problem in embodied AI research. Efforts to enhance spatial reasoning abilities through supplementary spatial data and fine-tuning have proven limited and ineffective when addressing complex embodied tasks,…

Spatial reasoning is a fundamental aspect of human cognition, enabling intuitive understanding and manipulation of objects in three-dimensional space. While foundation models demonstrate remarkable performance on some benchmarks, they still…

Computer Vision and Pattern Recognition · Computer Science 2025-03-12 Fan-Yun Sun , Weiyu Liu , Siyi Gu , Dylan Lim , Goutam Bhat , Federico Tombari , Manling Li , Nick Haber , Jiajun Wu

Vision Language Models (VLMs) have achieved impressive performance in 2D image understanding, however they are still struggling with spatial understanding which is the foundation of Embodied AI. In this paper, we propose SpatialBot for…

Computer Vision and Pattern Recognition · Computer Science 2025-03-20 Wenxiao Cai , Iaroslav Ponomarenko , Jianhao Yuan , Xiaoqi Li , Wankou Yang , Hao Dong , Bo Zhao

As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods…

Computer Vision and Pattern Recognition · Computer Science 2025-06-23 Junfei Wu , Jian Guan , Kaituo Feng , Qiang Liu , Shu Wu , Liang Wang , Wei Wu , Tieniu Tan

Visual reasoning, particularly spatial reasoning, is a challenging cognitive task that requires understanding object relationships and their interactions within complex environments, especially in robotics domain. Existing vision_language…

Robotics · Computer Science 2025-11-03 Simindokht Jahangard , Mehrzad Mohammadi , Abhinav Dhall , Hamid Rezatofighi

3D Visual Grounding (3DVG) focuses on locating objects in 3D scenes based on natural language descriptions, serving as a fundamental task for embodied AI and robotics. Recent advances in Multi-modal Large Language Models (MLLMs) have…

Computer Vision and Pattern Recognition · Computer Science 2025-12-02 Beining Xu , Siting Zhu , Zhao Jin , Junxian Li , Hesheng Wang

Vision-Language Models (VLMs) have recently emerged as powerful tools, excelling in tasks that integrate visual and textual comprehension, such as image captioning, visual question answering, and image-text retrieval. However, existing…

Computer Vision and Pattern Recognition · Computer Science 2025-03-26 Ilias Stogiannidis , Steven McDonagh , Sotirios A. Tsaftaris

Vision-language models (VLMs) have achieved impressive results on single-view vision tasks, but lack the multi-view spatial reasoning capabilities essential for embodied AI systems to understand 3D environments and manipulate objects across…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Suchae Jeong , Jaehwi Song , Haeone Lee , Hanna Kim , Jian Kim , Dongjun Lee , Dong Kyu Shin , Changyeon Kim , Dongyoon Hahm , Woogyeol Jin , Juheon Choi , Kimin Lee

Spatial reasoning -- the ability to perceive and reason about relationships in space -- advances vision-language models (VLMs) from visual perception toward spatial semantic understanding. Existing approaches either revisit local image…

Computer Vision and Pattern Recognition · Computer Science 2026-01-06 Weijian Ma , Shizhao Sun , Tianyu Yu , Ruiyu Wang , Tat-Seng Chua , Jiang Bian

Despite impressive advancements in Visual-Language Models (VLMs) for multi-modal tasks, their reliance on RGB inputs limits precise spatial understanding. Existing methods for integrating spatial cues, such as point clouds or depth, either…

Computer Vision and Pattern Recognition · Computer Science 2025-10-27 Yang Liu , Ming Ma , Xiaomin Yu , Pengxiang Ding , Han Zhao , Mingyang Sun , Siteng Huang , Donglin Wang

Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial…

Computer Vision and Pattern Recognition · Computer Science 2026-05-05 Jian Zhang , Shijie Zhou , Bangya Liu , Achuta Kadambi , Zhiwen Fan
‹ Prev 1 2 3 10 Next ›