Related papers: Object-and-Action Aware Model for Visual Language …

SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation

Natural language instructions for visual navigation often use scene descriptions (e.g., "bedroom") and object references (e.g., "green chairs") to provide a breadcrumb trail to a goal location. This work presents a transformer-based…

Computer Vision and Pattern Recognition · Computer Science 2021-10-28 Abhinav Moudgil , Arjun Majumdar , Harsh Agrawal , Stefan Lee , Dhruv Batra

Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces

Vision-and-Language Navigation (VLN) refers to the task of enabling autonomous robots to navigate unfamiliar environments by following natural language instructions. While recent Large Vision-Language Models (LVLMs) have shown promise in…

Computer Vision and Pattern Recognition · Computer Science 2025-08-06 Vebjørn Haug Kåsene , Pierre Lison

IVLMap: Instance-Aware Visual Language Grounding for Consumer Robot Navigation

Vision-and-Language Navigation (VLN) is a challenging task that requires a robot to navigate in photo-realistic environments with human natural language promptings. Recent studies aim to handle this task by constructing the semantic spatial…

Computer Vision and Pattern Recognition · Computer Science 2024-03-29 Jiacui Huang , Hongtao Zhang , Mingbo Zhao , Zhou Wu

Multi-modal Discriminative Model for Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) is a natural language grounding task where agents have to interpret natural language instructions in the context of visual scenes in a dynamic environment to achieve prescribed navigation goals.…

Computation and Language · Computer Science 2019-06-03 Haoshuo Huang , Vihan Jain , Harsh Mehta , Jason Baldridge , Eugene Ie

AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for…

Robotics · Computer Science 2026-05-22 Wenxuan Guo , Xiuwei Xu , Yichen Liu , Xiangyu Li , Hang Yin , Huangxing Chen , Wenzhao Zheng , Jianjiang Feng , Jie Zhou , Jiwen Lu

Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation

Vision-Language Navigation (VLN) is a challenging task which requires an agent to align complex visual observations to language instructions to reach the goal position. Most existing VLN agents directly learn to align the raw directional…

Computer Vision and Pattern Recognition · Computer Science 2024-03-15 Bingqian Lin , Yi Zhu , Xiaodan Liang , Liang Lin , Jianzhuang Liu

Mind the Gap: Improving Success Rate of Vision-and-Language Navigation by Revisiting Oracle Success Routes

Vision-and-Language Navigation (VLN) aims to navigate to the target location by following a given instruction. Unlike existing methods focused on predicting a more accurate action at each step in navigation, in this paper, we make the first…

Computer Vision and Pattern Recognition · Computer Science 2023-08-08 Chongyang Zhao , Yuankai Qi , Qi Wu

Language-guided Navigation via Cross-Modal Grounding and Alternate Adversarial Learning

The emerging vision-and-language navigation (VLN) problem aims at learning to navigate an agent to the target location in unseen photo-realistic environments according to the given language instruction. The main challenges of VLN arise…

Computer Vision and Pattern Recognition · Computer Science 2020-11-24 Weixia Zhang , Chao Ma , Qi Wu , Xiaokang Yang

Diagnosing the Environment Bias in Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions, explore the given environments, and reach the desired target locations. These step-by-step navigational instructions are crucial when the agent…

Computation and Language · Computer Science 2020-05-08 Yubo Zhang , Hao Tan , Mohit Bansal

Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments

In the Vision-and-Language Navigation (VLN) task an embodied agent navigates a 3D environment, following natural language instructions. A challenge in this task is how to handle 'off the path' scenarios where an agent veers from a reference…

Computer Vision and Pattern Recognition · Computer Science 2021-10-01 Sonia Raychaudhuri , Saim Wani , Shivansh Patel , Unnat Jain , Angel X. Chang

Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks

Vision-Language Navigation (VLN) is a task where agents learn to navigate following natural language instructions. The key to this task is to perceive both the visual scene and natural language sequentially. Conventional approaches exploit…

Computer Vision and Pattern Recognition · Computer Science 2020-04-02 Fengda Zhu , Yi Zhu , Xiaojun Chang , Xiaodan Liang

MAG-Nav: Language-Driven Object Navigation Leveraging Memory-Reserved Active Grounding

Visual navigation in unknown environments based solely on natural language descriptions is a key capability for intelligent robots. In this work, we propose a navigation framework built upon off-the-shelf Visual Language Models (VLMs),…

Robotics · Computer Science 2025-08-08 Weifan Zhang , Tingguang Li , Yuzhen Liu

Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology

Developing agents capable of navigating to a target location based on language instructions and visual information, known as vision-language navigation (VLN), has attracted widespread interest. Most research has focused on ground-based…

Computer Vision and Pattern Recognition · Computer Science 2024-10-11 Xiangyu Wang , Donglin Yang , Ziqin Wang , Hohin Kwan , Jinyu Chen , Wenjun Wu , Hongsheng Li , Yue Liao , Si Liu

Vision-Language Navigation: A Survey and Taxonomy

Vision-Language Navigation (VLN) tasks require an agent to follow human language instructions to navigate in previously unseen environments. This challenging field involving problems in natural language processing, computer vision,…

Computer Vision and Pattern Recognition · Computer Science 2022-04-05 Wansen Wu , Tao Chang , Xinmeng Li

Active Visual Information Gathering for Vision-Language Navigation

Vision-language navigation (VLN) is the task of entailing an agent to carry out navigational instructions inside photo-realistic environments. One of the key challenges in VLN is how to conduct a robust navigation by mitigating the…

Computer Vision and Pattern Recognition · Computer Science 2020-08-21 Hanqing Wang , Wenguan Wang , Tianmin Shu , Wei Liang , Jianbing Shen

$A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models

We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions without requiring any path-instruction…

Computer Vision and Pattern Recognition · Computer Science 2023-08-17 Peihao Chen , Xinyu Sun , Hongyan Zhi , Runhao Zeng , Thomas H. Li , Gaowen Liu , Mingkui Tan , Chuang Gan

VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View

Incremental decision making in real-world environments is one of the most challenging tasks in embodied artificial intelligence. One particularly demanding scenario is Vision and Language Navigation~(VLN) which requires visual and natural…

Artificial Intelligence · Computer Science 2024-01-25 Raphael Schumann , Wanrong Zhu , Weixi Feng , Tsu-Jui Fu , Stefan Riezler , William Yang Wang

Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

Aerial Vision-and-Language Navigation (VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and navigate complex urban environments using onboard visual observation. This task holds promise for…

Computer Vision and Pattern Recognition · Computer Science 2026-04-16 Huilin Xu , Zhuoyang Liu , Yixiang Luomei , Feng Xu

SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation

Vision-and-Language Navigation (VLN) aims to enable an embodied agent to follow natural-language instructions and navigate to a target location in unseen 3D environments. We argue that adapting VLMs to VLN requires endowing them with two…

Computer Vision and Pattern Recognition · Computer Science 2026-05-01 Pengna Li , Kangyi Wu , Shaoqing Xu , Fang Li , Hanbing Li , Lin Zhao , Kailin Lyu , Long Chen , Zhi-Xin Yang , Nanning Zheng

AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild

Vision-language navigation (VLN) requires intelligent agents to navigate environments by interpreting linguistic instructions alongside visual observations, serving as a cornerstone task in Embodied AI. Current VLN research for unmanned…

Robotics · Computer Science 2026-02-11 Xiaolou Sun , Wufei Si , Wenhui Ni , Yuntian Li , Dongming Wu , Fei Xie , Runwei Guan , He-Yang Xu , Henghui Ding , Yuan Wu , Yutao Yue , Yongming Huang , Hui Xiong