English
Related papers

Related papers: VISOR: VIsual Spatial Object Reasoning for Languag…

200 papers

We present Vision-based Navigation with Language-based Assistance (VNLA), a grounded vision-language task where an agent with visual perception is guided via language to find objects in photorealistic indoor environments. The task emulates…

Machine Learning · Computer Science 2019-04-09 Khanh Nguyen , Debadeepta Dey , Chris Brockett , Bill Dolan

Incremental decision making in real-world environments is one of the most challenging tasks in embodied artificial intelligence. One particularly demanding scenario is Vision and Language Navigation~(VLN) which requires visual and natural…

Artificial Intelligence · Computer Science 2024-01-25 Raphael Schumann , Wanrong Zhu , Weixi Feng , Tsu-Jui Fu , Stefan Riezler , William Yang Wang

Human-robot collaboration requires robots to quickly infer user intent, provide transparent reasoning, and assist users in achieving their goals. Our recent work introduced GUIDER, our framework for inferring navigation and manipulation…

Robotics · Computer Science 2025-08-18 Cesar Alan Contreras , Manolis Chiou , Alireza Rastegarpanah , Michal Szulik , Rustam Stolkin

While Vision-Language Models (VLMs) are set to transform robotic navigation, existing methods often underutilize their reasoning capabilities. To unlock the full potential of VLMs in robotics, we shift their role from passive observers to…

Robotics · Computer Science 2025-11-13 Mobin Habibpour , Fatemeh Afghah

Multimodal Large Language Models (MLLMs) excel at descriptive tasks within images but often struggle with precise object localization, a critical element for reliable visual interpretation. In contrast, traditional object detection models…

Computer Vision and Pattern Recognition · Computer Science 2024-11-18 Jingru Yang , Huan Yu , Yang Jingxin , Chentianye Xu , Yin Biao , Yu Sun , Shengfeng He

Visual navigation in unknown environments based solely on natural language descriptions is a key capability for intelligent robots. In this work, we propose a navigation framework built upon off-the-shelf Visual Language Models (VLMs),…

Robotics · Computer Science 2025-08-08 Weifan Zhang , Tingguang Li , Yuzhen Liu

Vision-and-language navigation (VLN) is a multimodal task where an agent follows natural language instructions and navigates in visual environments. Multiple setups have been proposed, and researchers apply new model architectures or…

Computer Vision and Pattern Recognition · Computer Science 2022-05-05 Wanrong Zhu , Yuankai Qi , Pradyumna Narayana , Kazoo Sone , Sugato Basu , Xin Eric Wang , Qi Wu , Miguel Eckstein , William Yang Wang

In Vision-and-Language Navigation (VLN), an embodied agent needs to reach a target destination with the only guidance of a natural language instruction. To explore the environment and progress towards the target location, the agent must…

Computer Vision and Pattern Recognition · Computer Science 2019-09-26 Federico Landi , Lorenzo Baraldi , Massimiliano Corsini , Rita Cucchiara

Vision-Language-Action (VLA) models mark a transformative advancement in artificial intelligence, aiming to unify perception, natural language understanding, and embodied action within a single computational framework. This foundational…

Computer Vision and Pattern Recognition · Computer Science 2026-02-02 Ranjan Sapkota , Yang Cao , Konstantinos I. Roumeliotis , Manoj Karkee

In this paper, we propose GTA-VLA(Guide, Think, Act), an interactive Vision-Language-Action (VLA) framework that enables spatially steerable embodied reasoning by allowing users to guide robot policies with explicit visual cues. Existing…

Robotics · Computer Science 2026-05-14 Yiran Ling , Qing Lian , Jinghang Li , Qing Jiang , Tianming Zhang , Xiaoke Jiang , Chuanxiu Liu , Jie Liu , Lei Zhang

Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for…

Vision-Language-Action (VLA) models are a promising path to realizing generalist embodied agents that can quickly adapt to new tasks, modalities, and environments. However, methods for interpreting and steering VLAs fall far short of…

Robotics · Computer Science 2025-09-03 Bear Häon , Kaylene Stocking , Ian Chuang , Claire Tomlin

Autonomous driving has long relied on modular "Perception-Decision-Action" pipelines, where hand-crafted interfaces and rule-based components often break down in complex or long-tailed scenarios. Their cascaded design further propagates…

Embodied navigation requires an agent to map language and visual observations to a stream of spatial actions that drive a real robot through environments it has never seen. The dominant approach has been to scale vision-language-action…

This paper presents a novel approach for the Vision-and-Language Navigation (VLN) task in continuous 3D environments, which requires an autonomous agent to follow natural language instructions in unseen environments. Existing end-to-end…

Vision-and-Language Navigation (VLN) requires an embodied agent to ground complex natural-language instructions into long-horizon navigation in unseen environments. While Vision-Language Models (VLMs) offer strong 2D semantic understanding,…

Robotics · Computer Science 2026-03-19 Zihao Xin , Wentong Li , Yixuan Jiang , Ziyuan Huang , Bin Wang , Piji Li , Jianke Zhu , Jie Qin , Shengjun Huang

Vision-and-Language Navigation (VLN) requires an embodied agent to navigate in a complex 3D environment according to natural language instructions. Recent progress in large language models (LLMs) has enabled language-driven navigation with…

Robotics · Computer Science 2026-01-27 Zijun Li , Shijie Li , Zhenxi Zhang , Bin Li , Shoujun Zhou

Existing aerial Vision-Language Navigation (VLN) methods predominantly adopt a detection-and-planning pipeline, which converts open-vocabulary detections into discrete textual scene graphs. These approaches are plagued by inadequate spatial…

Computer Vision and Pattern Recognition · Computer Science 2026-03-10 Haoyu Tong , Xiangyu Dong , Xiaoguang Ma , Haoran Zhao , Yaoming Zhou , Chenghao Lin

While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies…

Computer Vision and Pattern Recognition · Computer Science 2026-01-28 Ganlin Yang , Tianyi Zhang , Haoran Hao , Weiyun Wang , Yibin Liu , Dehui Wang , Guanzhou Chen , Zijian Cai , Junting Chen , Weijie Su , Wengang Zhou , Yu Qiao , Jifeng Dai , Jiangmiao Pang , Gen Luo , Wenhai Wang , Yao Mu , Zhi Hou

Vision-and-Language Navigation (VLN) is the task that requires an agent to navigate through the environment based on natural language instructions. At each step, the agent takes the next action by selecting from a set of navigable…

Computer Vision and Pattern Recognition · Computer Science 2023-04-12 Jialu Li , Mohit Bansal
‹ Prev 1 2 3 10 Next ›