Related papers: Semantic Audio-Visual Navigation

Semantic Audio-Visual Navigation in Continuous Environments

Audio-visual navigation enables embodied agents to navigate toward sound-emitting targets by leveraging both auditory and visual cues. However, most existing approaches rely on precomputed room impulse responses (RIRs) for binaural audio…

Computer Vision and Pattern Recognition · Computer Science 2026-04-02 Yichen Zeng , Hebaixu Wang , Meng Liu , Yu Zhou , Chen Gao , Kehan Chen , Gongping Huang

Learning to Set Waypoints for Audio-Visual Navigation

In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e.g., a phone ringing in another room). Existing models learn to act at a fixed…

Computer Vision and Pattern Recognition · Computer Science 2021-02-12 Changan Chen , Sagnik Majumder , Ziad Al-Halah , Ruohan Gao , Santhosh Kumar Ramakrishnan , Kristen Grauman

SoundSpaces: Audio-Visual Navigation in 3D Environments

Moving around in the world is naturally a multisensory experience, but today's embodied agents are deaf---restricted to solely their visual perception of the environment. We introduce audio-visual navigation for complex, acoustically and…

Computer Vision and Pattern Recognition · Computer Science 2020-08-25 Changan Chen , Unnat Jain , Carl Schissler , Sebastia Vicenc Amengual Gari , Ziad Al-Halah , Vamsi Krishna Ithapu , Philip Robinson , Kristen Grauman

Dynamical Audio-Visual Navigation: Catching Unheard Moving Sound Sources in Unmapped 3D Environments

Recent work on audio-visual navigation targets a single static sound in noise-free audio environments and struggles to generalize to unheard sounds. We introduce the novel dynamic audio-visual navigation benchmark in which an embodied AI…

Computer Vision and Pattern Recognition · Computer Science 2022-01-13 Abdelrahman Younes

Sound Adversarial Audio-Visual Navigation

Audio-visual navigation task requires an agent to find a sound source in a realistic, unmapped 3D environment by utilizing egocentric audio-visual observations. Existing audio-visual navigation works assume a clean environment that solely…

Sound · Computer Science 2022-02-23 Yinfeng Yu , Wenbing Huang , Fuchun Sun , Changan Chen , Yikai Wang , Xiaohong Liu

Visual Acoustic Matching

We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment. Given an image of the target environment and a waveform for the source audio, the goal is to…

Computer Vision and Pattern Recognition · Computer Science 2022-06-15 Changan Chen , Ruohan Gao , Paul Calamia , Kristen Grauman

Catch Me If You Hear Me: Audio-Visual Navigation in Complex Unmapped Environments with Moving Sounds

Audio-visual navigation combines sight and hearing to navigate to a sound-emitting source in an unmapped environment. While recent approaches have demonstrated the benefits of audio input to detect and find the goal, they focus on clean and…

Sound · Computer Science 2023-01-04 Abdelrahman Younes , Daniel Honerkamp , Tim Welschehold , Abhinav Valada

Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds

Humans can robustly recognize and localize objects by integrating visual and auditory cues. While machines are able to do the same now with images, less work has been done with sounds. This work develops an approach for dense semantic…

Computer Vision and Pattern Recognition · Computer Science 2020-03-10 Arun Balajee Vasudevan , Dengxin Dai , Luc Van Gool

Audio Spatially-Guided Fusion for Audio-Visual Navigation

Audio-visual Navigation refers to an agent utilizing visual and auditory information in complex 3D environments to accomplish target localization and path planning, thereby achieving autonomous navigation. The core challenge of this task…

Sound · Computer Science 2026-04-06 Xinyu Zhou , Yinfeng Yu

Learning to Map for Active Semantic Goal Navigation

We consider the problem of object goal navigation in unseen environments. Solving this problem requires learning of contextual semantic priors, a challenging endeavour given the spatial and semantic variability of indoor environments.…

Computer Vision and Pattern Recognition · Computer Science 2022-03-10 Georgios Georgakis , Bernadette Bucher , Karl Schmeckpeper , Siddharth Singh , Kostas Daniilidis

Look, Listen, and Act: Towards Audio-Visual Embodied Navigation

A crucial ability of mobile intelligent agents is to integrate the evidence from multiple sensory inputs in an environment and to make a sequence of actions to reach their goals. In this paper, we attempt to approach the problem of…

Computer Vision and Pattern Recognition · Computer Science 2020-03-10 Chuang Gan , Yiwei Zhang , Jiajun Wu , Boqing Gong , Joshua B. Tenenbaum

Objects that Sound

In this paper our objectives are, first, networks that can embed audio and visual inputs into a common space that is suitable for cross-modal retrieval; and second, a network that can localize the object that sounds in an image, given the…

Computer Vision and Pattern Recognition · Computer Science 2018-07-27 Relja Arandjelović , Andrew Zisserman

Active Audio-Visual Separation of Dynamic Sound Sources

We explore active audio-visual separation for dynamic sound sources, where an embodied agent moves intelligently in a 3D environment to continuously isolate the time-varying audio stream being emitted by an object of interest. The agent…

Computer Vision and Pattern Recognition · Computer Science 2022-07-26 Sagnik Majumder , Kristen Grauman

Multi-goal Audio-visual Navigation using Sound Direction Map

Over the past few years, there has been a great deal of research on navigation tasks in indoor environments using deep reinforcement learning agents. Most of these tasks use only visual information in the form of first-person images to…

Computer Vision and Pattern Recognition · Computer Science 2023-08-02 Haru Kondoh , Asako Kanezaki

Learning Semantic-Agnostic and Spatial-Aware Representation for Generalizable Visual-Audio Navigation

Visual-audio navigation (VAN) is attracting more and more attention from the robotic community due to its broad applications, \emph{e.g.}, household robots and rescue robots. In this task, an embodied agent must search for and navigate to…

Robotics · Computer Science 2023-06-22 Hongcheng Wang , Yuxuan Wang , Fangwei Zhong , Mingdong Wu , Jianwei Zhang , Yizhou Wang , Hao Dong

Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural Sounds

Humans can robustly recognize and localize objects by using visual and/or auditory cues. While machines are able to do the same with visual data already, less work has been done with sounds. This work develops an approach for scene…

Sound · Computer Science 2022-03-01 Dengxin Dai , Arun Balajee Vasudevan , Jiri Matas , Luc Van Gool

Audio-Guided Visual Perception for Audio-Visual Navigation

Audio-Visual Embodied Navigation aims to enable agents to autonomously navigate to sound sources in unknown 3D environments using auditory cues. While current AVN methods excel on in-distribution sound sources, they exhibit poor…

Sound · Computer Science 2025-10-15 Yi Wang , Yinfeng Yu , Fuchun Sun , Liejun Wang , Wendong Zheng

Towards Generalisable Audio Representations for Audio-Visual Navigation

In audio-visual navigation (AVN), an intelligent agent needs to navigate to a constantly sound-making object in complex 3D environments based on its audio and visual perceptions. While existing methods attempt to improve the navigation…

Sound · Computer Science 2022-06-02 Shunqi Mao , Chaoyi Zhang , Heng Wang , Weidong Cai

Semantic Hearing: Programming Acoustic Scenes with Binaural Hearables

Imagine being able to listen to the birds chirping in a park without hearing the chatter from other hikers, or being able to block out traffic noise on a busy street while still being able to hear emergency sirens and car honks. We…

Sound · Computer Science 2023-11-02 Bandhav Veluri , Malek Itani , Justin Chan , Takuya Yoshioka , Shyamnath Gollakota

Can Textual Semantics Mitigate Sounding Object Segmentation Preference?

The Audio-Visual Segmentation (AVS) task aims to segment sounding objects in the visual space using audio cues. However, in this work, it is recognized that previous AVS methods show a heavy reliance on detrimental segmentation preferences…

Computer Vision and Pattern Recognition · Computer Science 2024-07-16 Yaoting Wang , Peiwen Sun , Yuanchao Li , Honggang Zhang , Di Hu