Related papers: Learning a Visually Grounded Memory Assistant

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments

A robot that can carry out a natural-language instruction has been a dream since before the Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive robot helpers. It is a dream that remains stubbornly distant.…

Computer Vision and Pattern Recognition · Computer Science 2018-04-09 Peter Anderson , Qi Wu , Damien Teney , Jake Bruce , Mark Johnson , Niko Sünderhauf , Ian Reid , Stephen Gould , Anton van den Hengel

A Grounded Memory System For Smart Personal Assistants

A wide variety of agentic AI applications - ranging from cognitive assistants for dementia patients to robotics - demand a robust memory system grounded in reality. In this paper, we propose such a memory system consisting of three…

Artificial Intelligence · Computer Science 2025-05-13 Felix Ocker , Jörg Deigmöller , Pavel Smirnov , Julian Eggert

Towards self-attention based visual navigation in the real world

Vision guided navigation requires processing complex visual information to inform task-orientated decisions. Applications include autonomous robots, self-driving cars, and assistive vision for humans. A key element is the extraction and…

Robotics · Computer Science 2022-09-20 Jaime Ruiz-Serra , Jack White , Stephen Petrie , Tatiana Kameneva , Chris McCarthy

Multimodal Large Language Model for Visual Navigation

Recent efforts to enable visual navigation using large language models have mainly focused on developing complex prompt systems. These systems incorporate instructions, observations, and history into massive text prompts, which are then…

Computer Vision and Pattern Recognition · Computer Science 2023-11-07 Yao-Hung Hubert Tsai , Vansh Dhar , Jialu Li , Bowen Zhang , Jian Zhang

Memory-Maze: Scenario Driven Visual Language Navigation Benchmark for Guiding Blind People

Visual Language Navigation (VLN) powered robots have the potential to guide blind people by understanding route instructions provided by sighted passersby. This capability allows robots to operate in environments often unknown a prior.…

Robotics · Computer Science 2026-01-29 Masaki Kuribayashi , Kohei Uehara , Allan Wang , Daisuke Sato , Simon Chu , Shigeo Morishima

Planning from Imagination: Episodic Simulation and Episodic Memory for Vision-and-Language Navigation

Humans navigate unfamiliar environments using episodic simulation and episodic memory, which facilitate a deeper understanding of the complex relationships between environments and objects. Developing an imaginative memory system inspired…

Computer Vision and Pattern Recognition · Computer Science 2024-12-30 Yiyuan Pan , Yunzhe Xu , Zhe Liu , Hesheng Wang

SoundSpaces: Audio-Visual Navigation in 3D Environments

Moving around in the world is naturally a multisensory experience, but today's embodied agents are deaf---restricted to solely their visual perception of the environment. We introduce audio-visual navigation for complex, acoustically and…

Computer Vision and Pattern Recognition · Computer Science 2020-08-25 Changan Chen , Unnat Jain , Carl Schissler , Sebastia Vicenc Amengual Gari , Ziad Al-Halah , Vamsi Krishna Ithapu , Philip Robinson , Kristen Grauman

Embodied Visual Active Learning for Semantic Segmentation

We study the task of embodied visual active learning, where an agent is set to explore a 3d environment with the goal to acquire visual scene understanding by actively selecting views for which to request annotation. While accurate on some…

Computer Vision and Pattern Recognition · Computer Science 2020-12-18 David Nilsson , Aleksis Pirinen , Erik Gärtner , Cristian Sminchisescu

Embodied Learning for Lifelong Visual Perception

We study lifelong visual perception in an embodied setup, where we develop new models and compare various agents that navigate in buildings and occasionally request annotations which, in turn, are used to refine their visual perception…

Computer Vision and Pattern Recognition · Computer Science 2021-12-30 David Nilsson , Aleksis Pirinen , Erik Gärtner , Cristian Sminchisescu

Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning

Mobile agents that can leverage help from humans can potentially accomplish more complex tasks than they could entirely on their own. We develop "Help, Anna!" (HANNA), an interactive photo-realistic simulator in which an agent fulfills…

Human-Computer Interaction · Computer Science 2019-11-25 Khanh Nguyen , Hal Daumé

Crowdsourcing the Perception of Machine Teaching

Teachable interfaces can empower end-users to attune machine learning systems to their idiosyncratic characteristics and environment by explicitly providing pertinent training examples. While facilitating control, their effectiveness can be…

Human-Computer Interaction · Computer Science 2020-02-07 Jonggi Hong , Kyungjun Lee , June Xu , Hernisa Kacorri

MAG-Nav: Language-Driven Object Navigation Leveraging Memory-Reserved Active Grounding

Visual navigation in unknown environments based solely on natural language descriptions is a key capability for intelligent robots. In this work, we propose a navigation framework built upon off-the-shelf Visual Language Models (VLMs),…

Robotics · Computer Science 2025-08-08 Weifan Zhang , Tingguang Li , Yuzhen Liu

Situational Fusion of Visual Representation for Visual Navigation

A complex visual navigation task puts an agent in different situations which call for a diverse range of visual perception abilities. For example, to "go to the nearest chair", the agent might need to identify a chair in a living room using…

Computer Vision and Pattern Recognition · Computer Science 2021-08-05 Bokui Shen , Danfei Xu , Yuke Zhu , Leonidas J. Guibas , Li Fei-Fei , Silvio Savarese

Learning to Set Waypoints for Audio-Visual Navigation

In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e.g., a phone ringing in another room). Existing models learn to act at a fixed…

Computer Vision and Pattern Recognition · Computer Science 2021-02-12 Changan Chen , Sagnik Majumder , Ziad Al-Halah , Ruohan Gao , Santhosh Kumar Ramakrishnan , Kristen Grauman

Collecting Interactive Multi-modal Datasets for Grounded Language Understanding

Human intelligence can remarkably adapt quickly to new tasks and environments. Starting from a very young age, humans acquire new skills and learn how to solve new tasks either by imitating the behavior of others or by following provided…

Computation and Language · Computer Science 2023-03-22 Shrestha Mohanty , Negar Arabzadeh , Milagro Teruel , Yuxuan Sun , Artem Zholus , Alexey Skrynnik , Mikhail Burtsev , Kavya Srinet , Aleksandr Panov , Arthur Szlam , Marc-Alexandre Côté , Julia Kiseleva

Semantic Audio-Visual Navigation

Recent work on audio-visual navigation assumes a constantly-sounding target and restricts the role of audio to signaling the target's position. We introduce semantic audio-visual navigation, where objects in the environment make sounds…

Computer Vision and Pattern Recognition · Computer Science 2021-04-08 Changan Chen , Ziad Al-Halah , Kristen Grauman

Multimodal 3D Fusion and In-Situ Learning for Spatially Aware AI

Seamless integration of virtual and physical worlds in augmented reality benefits from the system semantically "understanding" the physical environment. AR research has long focused on the potential of context awareness, demonstrating novel…

Human-Computer Interaction · Computer Science 2024-10-08 Chengyuan Xu , Radha Kumaran , Noah Stier , Kangyou Yu , Tobias Höllerer

Embodied Navigation at the Art Gallery

Embodied agents, trained to explore and navigate indoor photorealistic environments, have achieved impressive results on standard datasets and benchmarks. So far, experiments and evaluations have involved domestic and working scenes like…

Computer Vision and Pattern Recognition · Computer Science 2024-04-16 Roberto Bigazzi , Federico Landi , Silvia Cascianelli , Marcella Cornia , Lorenzo Baraldi , Rita Cucchiara

MaAST: Map Attention with Semantic Transformersfor Efficient Visual Navigation

Visual navigation for autonomous agents is a core task in the fields of computer vision and robotics. Learning-based methods, such as deep reinforcement learning, have the potential to outperform the classical solutions developed for this…

Computer Vision and Pattern Recognition · Computer Science 2021-03-23 Zachary Seymour , Kowshik Thopalli , Niluthpol Mithun , Han-Pang Chiu , Supun Samarasekera , Rakesh Kumar

Personalized Large Language Model Assistant with Evolving Conditional Memory

With the rapid development of large language models, AI assistants like ChatGPT have become increasingly integrated into people's works and lives but are limited in personalized services. In this paper, we present a plug-and-play framework…

Computation and Language · Computer Science 2024-10-15 Ruifeng Yuan , Shichao Sun , Yongqi Li , Zili Wang , Ziqiang Cao , Wenjie Li