Related papers: VISOR: VIsual Spatial Object Reasoning for Languag…

Vision-based Navigation with Language-based Assistance via Imitation Learning with Indirect Intervention

We present Vision-based Navigation with Language-based Assistance (VNLA), a grounded vision-language task where an agent with visual perception is guided via language to find objects in photorealistic indoor environments. The task emulates…

Machine Learning · Computer Science 2019-04-09 Khanh Nguyen , Debadeepta Dey , Chris Brockett , Bill Dolan

VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View

Incremental decision making in real-world environments is one of the most challenging tasks in embodied artificial intelligence. One particularly demanding scenario is Vision and Language Navigation~(VLN) which requires visual and natural…

Artificial Intelligence · Computer Science 2024-01-25 Raphael Schumann , Wanrong Zhu , Weixi Feng , Tsu-Jui Fu , Stefan Riezler , William Yang Wang

Utilizing Vision-Language Models as Action Models for Intent Recognition and Assistance

Human-robot collaboration requires robots to quickly infer user intent, provide transparent reasoning, and assist users in achieving their goals. Our recent work introduced GUIDER, our framework for inferring navigation and manipulation…

Robotics · Computer Science 2025-08-18 Cesar Alan Contreras , Manolis Chiou , Alireza Rastegarpanah , Michal Szulik , Rustam Stolkin

Think, Remember, Navigate: Zero-Shot Object-Goal Navigation with VLM-Powered Reasoning

While Vision-Language Models (VLMs) are set to transform robotic navigation, existing methods often underutilize their reasoning capabilities. To unlock the full potential of VLMs in robotics, we shift their role from passive observers to…

Robotics · Computer Science 2025-11-13 Mobin Habibpour , Fatemeh Afghah

Visual-Linguistic Agent: Towards Collaborative Contextual Object Reasoning

Multimodal Large Language Models (MLLMs) excel at descriptive tasks within images but often struggle with precise object localization, a critical element for reliable visual interpretation. In contrast, traditional object detection models…

Computer Vision and Pattern Recognition · Computer Science 2024-11-18 Jingru Yang , Huan Yu , Yang Jingxin , Chentianye Xu , Yin Biao , Yu Sun , Shengfeng He

MAG-Nav: Language-Driven Object Navigation Leveraging Memory-Reserved Active Grounding

Visual navigation in unknown environments based solely on natural language descriptions is a key capability for intelligent robots. In this work, we propose a navigation framework built upon off-the-shelf Visual Language Models (VLMs),…

Robotics · Computer Science 2025-08-08 Weifan Zhang , Tingguang Li , Yuzhen Liu

Diagnosing Vision-and-Language Navigation: What Really Matters

Vision-and-language navigation (VLN) is a multimodal task where an agent follows natural language instructions and navigates in visual environments. Multiple setups have been proposed, and researchers apply new model architectures or…

Computer Vision and Pattern Recognition · Computer Science 2022-05-05 Wanrong Zhu , Yuankai Qi , Pradyumna Narayana , Kazoo Sone , Sugato Basu , Xin Eric Wang , Qi Wu , Miguel Eckstein , William Yang Wang

Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters

In Vision-and-Language Navigation (VLN), an embodied agent needs to reach a target destination with the only guidance of a natural language instruction. To explore the environment and progress towards the target location, the agent must…

Computer Vision and Pattern Recognition · Computer Science 2019-09-26 Federico Landi , Lorenzo Baraldi , Massimiliano Corsini , Rita Cucchiara

Vision-Language-Action (VLA) Models: Concepts, Progress, Applications and Challenges

Vision-Language-Action (VLA) models mark a transformative advancement in artificial intelligence, aiming to unify perception, natural language understanding, and embodied action within a single computational framework. This foundational…

Computer Vision and Pattern Recognition · Computer Science 2026-02-02 Ranjan Sapkota , Yang Cao , Konstantinos I. Roumeliotis , Manoj Karkee

Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

In this paper, we propose GTA-VLA(Guide, Think, Act), an interactive Vision-Language-Action (VLA) framework that enables spatially steerable embodied reasoning by allowing users to guide robot policies with explicit visual cues. Existing…

Robotics · Computer Science 2026-05-14 Yiran Ling , Qing Lian , Jinghang Li , Qing Jiang , Tianming Zhang , Xiaoke Jiang , Chuanxiu Liu , Jie Liu , Lei Zhang

AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for…

Robotics · Computer Science 2026-05-22 Wenxuan Guo , Xiuwei Xu , Yichen Liu , Xiangyu Li , Hang Yin , Huangxing Chen , Wenzhao Zheng , Jianjiang Feng , Jie Zhou , Jiwen Lu

Mechanistic interpretability for steering vision-language-action models

Vision-Language-Action (VLA) models are a promising path to realizing generalist embodied agents that can quickly adapt to new tasks, modalities, and environments. However, methods for interpreting and steering VLAs fall far short of…

Robotics · Computer Science 2025-09-03 Bear Häon , Kaylene Stocking , Ian Chuang , Claire Tomlin

Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

Autonomous driving has long relied on modular "Perception-Decision-Action" pipelines, where hand-crafted interfaces and rule-based components often break down in complex or long-tailed scenarios. Their cascaded design further propagates…

Robotics · Computer Science 2026-01-06 Tianshuai Hu , Xiaolu Liu , Song Wang , Yiyao Zhu , Ao Liang , Lingdong Kong , Guoyang Zhao , Zeying Gong , Jun Cen , Zhiyu Huang , Xiaoshuai Hao , Linfeng Li , Hang Song , Xiangtai Li , Jun Ma , Shaojie Shen , Jianke Zhu , Dacheng Tao , Ziwei Liu , Junwei Liang

Uni-LaViRA: Language-Vision-Robot Actions Translation for Unified Embodied Navigation

Embodied navigation requires an agent to map language and visual observations to a stream of spatial actions that drive a real robot through environments it has never seen. The dominant approach has been to scale vision-language-action…

Robotics · Computer Science 2026-05-28 Hongyu Ding , Sizhuo Zhang , Ziming Xu , Jinwen Guo , Hongxiu Liu , Xingzhi Cheng , Zixuan Chen , Haifei Qi , Duo Wang , Hao Xu , Jieqi Shi , Yifan Zhang , Jing Huo , Jian Cheng , Yang Gao , Jiebo Luo

SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments

This paper presents a novel approach for the Vision-and-Language Navigation (VLN) task in continuous 3D environments, which requires an autonomous agent to follow natural language instructions in unseen environments. Existing end-to-end…

Robotics · Computer Science 2021-08-27 Muhammad Zubair Irshad , Niluthpol Chowdhury Mithun , Zachary Seymour , Han-Pang Chiu , Supun Samarasekera , Rakesh Kumar

AgentVLN: Towards Agentic Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) requires an embodied agent to ground complex natural-language instructions into long-horizon navigation in unseen environments. While Vision-Language Models (VLMs) offer strong 2D semantic understanding,…

Robotics · Computer Science 2026-03-19 Zihao Xin , Wentong Li , Yixuan Jiang , Ziyuan Huang , Bin Wang , Piji Li , Jianke Zhu , Jie Qin , Shengjun Huang

DV-VLN: Dual Verification for Reliable LLM-Based Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) requires an embodied agent to navigate in a complex 3D environment according to natural language instructions. Recent progress in large language models (LLMs) has enabled language-driven navigation with…

Robotics · Computer Science 2026-01-27 Zijun Li , Shijie Li , Zhenxi Zhang , Bin Li , Shoujun Zhou

ViSA-Enhanced Aerial VLN: A Visual-Spatial Reasoning Enhanced Framework for Aerial Vision-Language Navigation

Existing aerial Vision-Language Navigation (VLN) methods predominantly adopt a detection-and-planning pipeline, which converts open-vocabulary detections into discrete textual scene graphs. These approaches are plagued by inadequate spatial…

Computer Vision and Pattern Recognition · Computer Science 2026-03-10 Haoyu Tong , Xiangyu Dong , Xiaoguang Ma , Haoran Zhao , Yaoming Zhou , Chenghao Lin

Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies…

Computer Vision and Pattern Recognition · Computer Science 2026-01-28 Ganlin Yang , Tianyi Zhang , Haoran Hao , Weiyun Wang , Yibin Liu , Dehui Wang , Guanzhou Chen , Zijian Cai , Junting Chen , Weijie Su , Wengang Zhou , Yu Qiao , Jifeng Dai , Jiangmiao Pang , Gen Luo , Wenhai Wang , Yao Mu , Zhi Hou

Improving Vision-and-Language Navigation by Generating Future-View Image Semantics

Vision-and-Language Navigation (VLN) is the task that requires an agent to navigate through the environment based on natural language instructions. At each step, the agent takes the next action by selecting from a set of navigable…

Computer Vision and Pattern Recognition · Computer Science 2023-04-12 Jialu Li , Mohit Bansal