Related papers: Instance-Level Semantic Maps for Vision Language N…

IVLMap: Instance-Aware Visual Language Grounding for Consumer Robot Navigation

Vision-and-Language Navigation (VLN) is a challenging task that requires a robot to navigate in photo-realistic environments with human natural language promptings. Recent studies aim to handle this task by constructing the semantic spatial…

Computer Vision and Pattern Recognition · Computer Science 2024-03-29 Jiacui Huang , Hongtao Zhang , Mingbo Zhao , Zhou Wu

Open-Set 3D Semantic Instance Maps for Vision Language Navigation -- O3D-SIM

Humans excel at forming mental maps of their surroundings, equipping them to understand object relationships and navigate based on language queries. Our previous work, SI Maps (Nanwani L, Agarwal A, Jain K, et al. Instance-level semantic…

Computer Vision and Pattern Recognition · Computer Science 2025-10-28 Laksh Nanwani , Kumaraditya Gupta , Aditya Mathur , Swayam Agrawal , A. H. Abdul Hafez , K. Madhava Krishna

Cross-modal Map Learning for Vision and Language Navigation

We consider the problem of Vision-and-Language Navigation (VLN). The majority of current methods for VLN are trained end-to-end using either unstructured memory such as LSTM, or using cross-modal attention over the egocentric observations…

Computer Vision and Pattern Recognition · Computer Science 2022-03-22 Georgios Georgakis , Karl Schmeckpeper , Karan Wanchoo , Soham Dan , Eleni Miltsakaki , Dan Roth , Kostas Daniilidis

Mapping High-level Semantic Regions in Indoor Environments without Object Recognition

Robots require a semantic understanding of their surroundings to operate in an efficient and explainable way in human environments. In the literature, there has been an extensive focus on object labeling and exhaustive scene graph…

Robotics · Computer Science 2024-04-16 Roberto Bigazzi , Lorenzo Baraldi , Shreyas Kousik , Rita Cucchiara , Marco Pavone

Language and Visual Entity Relationship Graph for Agent Navigation

Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions. From both the textual and visual perspectives, we find that the relationships among the scene, its…

Computer Vision and Pattern Recognition · Computer Science 2020-12-29 Yicong Hong , Cristian Rodriguez-Opazo , Yuankai Qi , Qi Wu , Stephen Gould

LiLMaps: Learnable Implicit Language Maps

One of the current trends in robotics is to employ large language models (LLMs) to provide non-predefined command execution and natural human-robot interaction. It is useful to have an environment map together with its language…

Robotics · Computer Science 2025-01-09 Evgenii Kruzhkov , Sven Behnke

Volumetric Instance-Aware Semantic Mapping and 3D Object Discovery

To autonomously navigate and plan interactions in real-world environments, robots require the ability to robustly perceive and map complex, unstructured surrounding scenes. Besides building an internal representation of the observed scene…

Robotics · Computer Science 2021-05-18 Margarita Grinvald , Fadri Furrer , Tonci Novkovic , Jen Jen Chung , Cesar Cadena , Roland Siegwart , Juan Nieto

ImagineNav: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

Visual navigation is an essential skill for home-assistance robots, providing the object-searching ability to accomplish long-horizon daily tasks. Many recent approaches use Large Language Models (LLMs) for commonsense inference to improve…

Robotics · Computer Science 2024-10-15 Xinxin Zhao , Wenzhe Cai , Likun Tang , Teng Wang

ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

Visual navigation is a fundamental capability for autonomous home-assistance robots, enabling long-horizon tasks such as object search. While recent methods have leveraged Large Language Models (LLMs) to incorporate commonsense reasoning…

Robotics · Computer Science 2026-05-01 Teng Wang , Xinxin Zhao , Wenzhe Cai , Changyin Sun

MC-GPT: Empowering Vision-and-Language Navigation with Memory Map and Reasoning Chains

In the Vision-and-Language Navigation (VLN) task, the agent is required to navigate to a destination following a natural language instruction. While learning-based approaches have been a major solution to the task, they suffer from high…

Artificial Intelligence · Computer Science 2024-08-13 Zhaohuan Zhan , Lisha Yu , Sijie Yu , Guang Tan

Sign Language: Towards Sign Understanding for Robot Autonomy

Navigational signs are common aids for human wayfinding and scene understanding, but are underutilized by robots. We argue that they benefit robot navigation and scene understanding, by directly encoding privileged information on actions,…

Robotics · Computer Science 2025-09-17 Ayush Agrawal , Joel Loo , Nicky Zimmerman , David Hsu

GridMM: Grid Memory Map for Vision-and-Language Navigation

Vision-and-language navigation (VLN) enables the agent to navigate to a remote location following the natural language instruction in 3D environments. To represent the previously visited environment, most approaches for VLN implement memory…

Computer Vision and Pattern Recognition · Computer Science 2023-08-25 Zihan Wang , Xiangyang Li , Jiahao Yang , Yeqi Liu , Shuqiang Jiang

LangNav: Language as a Perceptual Representation for Navigation

We explore the use of language as a perceptual representation for vision-and-language navigation (VLN), with a focus on low-data settings. Our approach uses off-the-shelf vision systems for image captioning and object detection to convert…

Computer Vision and Pattern Recognition · Computer Science 2024-04-02 Bowen Pan , Rameswar Panda , SouYoung Jin , Rogerio Feris , Aude Oliva , Phillip Isola , Yoon Kim

Improving Vision-and-Language Navigation by Generating Future-View Image Semantics

Vision-and-Language Navigation (VLN) is the task that requires an agent to navigate through the environment based on natural language instructions. At each step, the agent takes the next action by selecting from a set of navigable…

Computer Vision and Pattern Recognition · Computer Science 2023-04-12 Jialu Li , Mohit Bansal

BEVBert: Multimodal Map Pre-training for Language-guided Navigation

Large-scale pre-training has shown promising results on the vision-and-language navigation (VLN) task. However, most existing pre-training methods employ discrete panoramas to learn visual-textual associations. This requires the model to…

Computer Vision and Pattern Recognition · Computer Science 2023-08-04 Dong An , Yuankai Qi , Yangguang Li , Yan Huang , Liang Wang , Tieniu Tan , Jing Shao

Vision-based Navigation with Language-based Assistance via Imitation Learning with Indirect Intervention

We present Vision-based Navigation with Language-based Assistance (VNLA), a grounded vision-language task where an agent with visual perception is guided via language to find objects in photorealistic indoor environments. The task emulates…

Machine Learning · Computer Science 2019-04-09 Khanh Nguyen , Debadeepta Dey , Chris Brockett , Bill Dolan

SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments

This paper presents a novel approach for the Vision-and-Language Navigation (VLN) task in continuous 3D environments, which requires an autonomous agent to follow natural language instructions in unseen environments. Existing end-to-end…

Robotics · Computer Science 2021-08-27 Muhammad Zubair Irshad , Niluthpol Chowdhury Mithun , Zachary Seymour , Han-Pang Chiu , Supun Samarasekera , Rakesh Kumar

Diagnosing Vision-and-Language Navigation: What Really Matters

Vision-and-language navigation (VLN) is a multimodal task where an agent follows natural language instructions and navigates in visual environments. Multiple setups have been proposed, and researchers apply new model architectures or…

Computer Vision and Pattern Recognition · Computer Science 2022-05-05 Wanrong Zhu , Yuankai Qi , Pradyumna Narayana , Kazoo Sone , Sugato Basu , Xin Eric Wang , Qi Wu , Miguel Eckstein , William Yang Wang

VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View

Incremental decision making in real-world environments is one of the most challenging tasks in embodied artificial intelligence. One particularly demanding scenario is Vision and Language Navigation~(VLN) which requires visual and natural…

Artificial Intelligence · Computer Science 2024-01-25 Raphael Schumann , Wanrong Zhu , Weixi Feng , Tsu-Jui Fu , Stefan Riezler , William Yang Wang

Follow the Signs: Using Textual Cues and LLMs to Guide Efficient Robot Navigation

Autonomous navigation in unfamiliar environments often relies on geometric mapping and planning strategies that overlook rich semantic cues such as signs, room numbers, and textual labels. We propose a novel semantic navigation framework…

Robotics · Computer Science 2026-01-13 Jing Cao , Nishanth Kumar , Aidan Curtis