English
Related papers

Related papers: Task-oriented Sequential Grounding and Navigation …

200 papers

Prior studies on 3D scene understanding have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal…

Computer Vision and Pattern Recognition · Computer Science 2024-11-19 Yilun Chen , Shuai Yang , Haifeng Huang , Tai Wang , Runsen Xu , Ruiyuan Lyu , Dahua Lin , Jiangmiao Pang

To perform tasks specified by natural language instructions, autonomous agents need to extract semantically meaningful representations of language and map it to visual elements and actions in the environment. This problem is called…

Vision-Language Models (VLMs) have advanced rapidly in multimodal perception and language understanding, yet it remains unclear whether they can reliably ground language into spatially coherent, plausibly executable actions in 3D digital…

Computer Vision and Pattern Recognition · Computer Science 2026-05-12 Niyati Rawal , Sushant Ravva , Shah Alam Abir , Saksham Jain , Aman Chadha , Vinija Jain , Suranjana Trivedy , Amitava Das

3D object grounding localizes referred objects in a 3D scene from natural language. Unified instance-centric 3D-LLMs aim to solve grounding together with dialog, QA, and captioning, yet many rely on a single pointer-style grounding decision…

Computer Vision and Pattern Recognition · Computer Science 2026-05-28 Jiawei Li , Ziyi Liu , Weijie Shi , Long Chen , Jiajie Xu , Xiaofang Zhou

With the recent rise of Large Language Models (LLMs), Vision-Language Models (VLMs), and other general foundation models, there is growing potential for multimodal, multi-task embodied agents that can operate in diverse environments given…

Robotics · Computer Science 2024-11-07 Haochen Zhang , Nader Zantout , Pujith Kachana , Zongyuan Wu , Ji Zhang , Wenshan Wang

Sequential grounding in 3D point clouds (SG3D) refers to locating sequences of objects by following text instructions for a daily activity with detailed steps. Current 3D visual grounding (3DVG) methods treat text instructions with multiple…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Zijun Lin , Shuting He , Cheston Tan , Bihan Wen

We introduce the novel task of Language-Guided Object Placement in Real 3D Scenes. Our model is given a 3D scene's point cloud, a 3D asset, and a textual prompt broadly describing where the 3D asset should be placed. The task here is to…

Computer Vision and Pattern Recognition · Computer Science 2025-10-03 Ahmed Abdelreheem , Filippo Aleotti , Jamie Watson , Zawar Qureshi , Abdelrahman Eldesokey , Peter Wonka , Gabriel Brostow , Sara Vicente , Guillermo Garcia-Hernando

We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natural language descriptions. Existing 3D visual grounding tasks focus on localizing a unique object given a text description. However, such a…

Computer Vision and Pattern Recognition · Computer Science 2023-09-12 Yiming Zhang , ZeMing Gong , Angel X. Chang

New era has unlocked exciting possibilities for extending Large Language Models (LLMs) to tackle 3D vision-language tasks. However, most existing 3D multimodal LLMs (MLLMs) rely on compressing holistic 3D scene information or segmenting…

Computer Vision and Pattern Recognition · Computer Science 2025-07-23 Xiaoyan Wang , Zeju Li , Yifan Xu , Jiaxing Qi , Zhifei Yang , Ruifei Ma , Xiangde Liu , Chao Zhang

Enabling agents to understand and interact with complex 3D scenes is a fundamental challenge for embodied artificial intelligence systems. While Multimodal Large Language Models (MLLMs) have achieved significant progress in 2D image…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Haoyuan Li , Rui Liu , Hehe Fan , Yi Yang

3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in…

Computer Vision and Pattern Recognition · Computer Science 2024-09-25 Baoxiong Jia , Yixin Chen , Huangyue Yu , Yan Wang , Xuesong Niu , Tengyu Liu , Qing Li , Siyuan Huang

Large language models (LLMs) have demonstrated impressive results in developing generalist planning agents for diverse tasks. However, grounding these plans in expansive, multi-floor, and multi-room environments presents a significant…

Robotics · Computer Science 2023-09-29 Krishan Rana , Jesse Haviland , Sourav Garg , Jad Abou-Chakra , Ian Reid , Niko Suenderhauf

Localizing objects in 3D scenes according to the semantics of a given natural language is a fundamental yet important task in the field of multimedia understanding, which benefits various real-world applications such as robotics and…

Computer Vision and Pattern Recognition · Computer Science 2023-09-06 Wencan Huang , Daizong Liu , Wei Hu

Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks, enabling robots to plan and execute actions adaptively in dynamic environments. However, most multimodal large language models lack…

Robotics · Computer Science 2025-02-14 Guoqin Tang , Qingxuan Jia , Zeyuan Huang , Gang Chen , Ning Ji , Zhipeng Yao

We study the problem of jointly reasoning about language and vision through a navigation and spatial reasoning task. We introduce the Touchdown task and dataset, where an agent must first follow navigation instructions in a real-life visual…

Computer Vision and Pattern Recognition · Computer Science 2020-05-19 Howard Chen , Alane Suhr , Dipendra Misra , Noah Snavely , Yoav Artzi

Vision-and-Language Navigation (VLN) is a core task where embodied agents leverage their spatial mobility to navigate in 3D environments toward designated destinations based on natural language instructions. Recently, video-language large…

Computer Vision and Pattern Recognition · Computer Science 2025-05-19 Zihan Wang , Seungjun Lee , Gim Hee Lee

3D visual grounding (3DVG) involves localizing entities in a 3D scene referred to by natural language text. Such models are useful for embodied AI and scene retrieval applications, which involve searching for objects or patterns using…

Computer Vision and Pattern Recognition · Computer Science 2025-07-09 Austin T. Wang , ZeMing Gong , Angel X. Chang

3D layout tasks have traditionally concentrated on geometric constraints, but many practical applications demand richer contextual understanding that spans social interactions, cultural traditions, and usage conventions. Existing methods…

Graphics · Computer Science 2025-04-01 Yuto Asano , Naruya Kondo , Tatsuki Fushimi , Yoichi Ochiai

3D Visual Grounding (3DVG) aims to localize objects in 3D scenes using natural language descriptions. Although supervised methods achieve higher accuracy in constrained settings, zero-shot 3DVG holds greater promise for real-world…

Computer Vision and Pattern Recognition · Computer Science 2025-08-29 Jiawen Lin , Shiran Bian , Yihang Zhu , Wenbin Tan , Yachao Zhang , Yuan Xie , Yanyun Qu

Large language models (LLMs) have achieved remarkable success in text-based tasks but often struggle to provide actionable guidance in real-world physical environments. This is because of their inability to recognize their limited…

Computer Vision and Pattern Recognition · Computer Science 2025-03-05 Muhammad Saif Ullah Khan , Muhammad Zeshan Afzal , Didier Stricker
‹ Prev 1 2 3 10 Next ›