Related papers: Task-oriented Sequential Grounding and Navigation …

Grounded 3D-LLM with Referent Tokens

Prior studies on 3D scene understanding have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal…

Computer Vision and Pattern Recognition · Computer Science 2024-11-19 Yilun Chen , Shuai Yang , Haifeng Huang , Tai Wang , Runsen Xu , Ruiyuan Lyu , Dahua Lin , Jiangmiao Pang

Gated-Attention Architectures for Task-Oriented Language Grounding

To perform tasks specified by natural language instructions, autonomous agents need to extract semantically meaningful representations of language and map it to visual elements and actions in the environment. This problem is called…

Machine Learning · Computer Science 2018-01-10 Devendra Singh Chaplot , Kanthashree Mysore Sathyendra , Rama Kumar Pasumarthi , Dheeraj Rajagopal , Ruslan Salakhutdinov

SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation

Vision-Language Models (VLMs) have advanced rapidly in multimodal perception and language understanding, yet it remains unclear whether they can reliably ground language into spatially coherent, plausibly executable actions in 3D digital…

Computer Vision and Pattern Recognition · Computer Science 2026-05-12 Niyati Rawal , Sushant Ravva , Shah Alam Abir , Saksham Jain , Aman Chadha , Vinija Jain , Suranjana Trivedy , Amitava Das

SSR3D-LLM: Structured Spatial Reasoning via Latent Steps for Fine-Grained Grounding in Unified 3D-LLMs

3D object grounding localizes referred objects in a 3D scene from natural language. Unified instance-centric 3D-LLMs aim to solve grounding together with dialog, QA, and captioning, yet many rely on a single pointer-style grounding decision…

Computer Vision and Pattern Recognition · Computer Science 2026-05-28 Jiawei Li , Ziyi Liu , Weijie Shi , Long Chen , Jiajie Xu , Xiaofang Zhou

VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation

With the recent rise of Large Language Models (LLMs), Vision-Language Models (VLMs), and other general foundation models, there is growing potential for multimodal, multi-task embodied agents that can operate in diverse environments given…

Robotics · Computer Science 2024-11-07 Haochen Zhang , Nader Zantout , Pujith Kachana , Zongyuan Wu , Ji Zhang , Wenshan Wang

GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding

Sequential grounding in 3D point clouds (SG3D) refers to locating sequences of objects by following text instructions for a daily activity with detailed steps. Current 3D visual grounding (3DVG) methods treat text instructions with multiple…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Zijun Lin , Shuting He , Cheston Tan , Bihan Wen

PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes

We introduce the novel task of Language-Guided Object Placement in Real 3D Scenes. Our model is given a 3D scene's point cloud, a 3D asset, and a textual prompt broadly describing where the 3D asset should be placed. The task here is to…

Computer Vision and Pattern Recognition · Computer Science 2025-10-03 Ahmed Abdelreheem , Filippo Aleotti , Jamie Watson , Zawar Qureshi , Abdelrahman Eldesokey , Peter Wonka , Gabriel Brostow , Sara Vicente , Guillermo Garcia-Hernando

Multi3DRefer: Grounding Text Description to Multiple 3D Objects

We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natural language descriptions. Existing 3D visual grounding tasks focus on localizing a unique object given a text description. However, such a…

Computer Vision and Pattern Recognition · Computer Science 2023-09-12 Yiming Zhang , ZeMing Gong , Angel X. Chang

Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models

New era has unlocked exciting possibilities for extending Large Language Models (LLMs) to tackle 3D vision-language tasks. However, most existing 3D multimodal LLMs (MLLMs) rely on compressing holistic 3D scene information or segmenting…

Computer Vision and Pattern Recognition · Computer Science 2025-07-23 Xiaoyan Wang , Zeju Li , Yifan Xu , Jiaxing Qi , Zhifei Yang , Ruifei Ma , Xiangde Liu , Chao Zhang

Text-Scene: A Scene-to-Language Parsing Framework for 3D Scene Understanding

Enabling agents to understand and interact with complex 3D scenes is a fundamental challenge for embodied artificial intelligence systems. While Multimodal Large Language Models (MLLMs) have achieved significant progress in 2D image…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Haoyuan Li , Rui Liu , Hehe Fan , Yi Yang

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in…

Computer Vision and Pattern Recognition · Computer Science 2024-09-25 Baoxiong Jia , Yixin Chen , Huangyue Yu , Yan Wang , Xuesong Niu , Tengyu Liu , Qing Li , Siyuan Huang

SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning

Large language models (LLMs) have demonstrated impressive results in developing generalist planning agents for diverse tasks. However, grounding these plans in expansive, multi-floor, and multi-room environments presents a significant…

Robotics · Computer Science 2023-09-29 Krishan Rana , Jesse Haviland , Sourav Garg , Jad Abou-Chakra , Ian Reid , Niko Suenderhauf

Dense Object Grounding in 3D Scenes

Localizing objects in 3D scenes according to the semantics of a given natural language is a fundamental yet important task in the field of multimedia understanding, which benefits various real-world applications such as robotics and…

Computer Vision and Pattern Recognition · Computer Science 2023-09-06 Wencan Huang , Daizong Liu , Wei Hu

3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning

Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks, enabling robots to plan and execute actions adaptively in dynamic environments. However, most multimodal large language models lack…

Robotics · Computer Science 2025-02-14 Guoqin Tang , Qingxuan Jia , Zeyuan Huang , Gang Chen , Ning Ji , Zhipeng Yao

Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments

We study the problem of jointly reasoning about language and vision through a navigation and spatial reasoning task. We introduce the Touchdown task and dataset, where an agent must first follow navigation instructions in a real-life visual…

Computer Vision and Pattern Recognition · Computer Science 2020-05-19 Howard Chen , Alane Suhr , Dipendra Misra , Noah Snavely , Yoav Artzi

Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) is a core task where embodied agents leverage their spatial mobility to navigate in 3D environments toward designated destinations based on natural language instructions. Recently, video-language large…

Computer Vision and Pattern Recognition · Computer Science 2025-05-19 Zihan Wang , Seungjun Lee , Gim Hee Lee

ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding

3D visual grounding (3DVG) involves localizing entities in a 3D scene referred to by natural language text. Such models are useful for embodied AI and scene retrieval applications, which involve searching for objects or patterns using…

Computer Vision and Pattern Recognition · Computer Science 2025-07-09 Austin T. Wang , ZeMing Gong , Angel X. Chang

From Geometry to Culture: An Iterative VLM Layout Framework for Placing Objects in Complex 3D Scene Contexts

3D layout tasks have traditionally concentrated on geometric constraints, but many practical applications demand richer contextual understanding that spans social interactions, cultural traditions, and usage conventions. Existing methods…

Graphics · Computer Science 2025-04-01 Yuto Asano , Naruya Kondo , Tatsuki Fushimi , Yoichi Ochiai

SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding

3D Visual Grounding (3DVG) aims to localize objects in 3D scenes using natural language descriptions. Although supervised methods achieve higher accuracy in constrained settings, zero-shot 3DVG holds greater promise for real-world…

Computer Vision and Pattern Recognition · Computer Science 2025-08-29 Jiawen Lin , Shiran Bian , Yihang Zhu , Wenbin Tan , Yachao Zhang , Yuan Xie , Yanyun Qu

SituationalLLM: Proactive language models with scene awareness for dynamic, contextual task guidance

Large language models (LLMs) have achieved remarkable success in text-based tasks but often struggle to provide actionable guidance in real-world physical environments. This is because of their inability to recognize their limited…

Computer Vision and Pattern Recognition · Computer Science 2025-03-05 Muhammad Saif Ullah Khan , Muhammad Zeshan Afzal , Didier Stricker