Related papers: Move2Hear: Active Audio-Visual Source Separation

Active Audio-Visual Separation of Dynamic Sound Sources

We explore active audio-visual separation for dynamic sound sources, where an embodied agent moves intelligently in a 3D environment to continuously isolate the time-varying audio stream being emitted by an object of interest. The agent…

Computer Vision and Pattern Recognition · Computer Science 2022-07-26 Sagnik Majumder , Kristen Grauman

Learning to Set Waypoints for Audio-Visual Navigation

In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e.g., a phone ringing in another room). Existing models learn to act at a fixed…

Computer Vision and Pattern Recognition · Computer Science 2021-02-12 Changan Chen , Sagnik Majumder , Ziad Al-Halah , Ruohan Gao , Santhosh Kumar Ramakrishnan , Kristen Grauman

Visually Guided Sound Source Separation and Localization using Self-Supervised Motion Representations

The objective of this paper is to perform audio-visual sound source separation, i.e.~to separate component audios from a mixture based on the videos of sound sources. Moreover, we aim to pinpoint the source location in the input video…

Computer Vision and Pattern Recognition · Computer Science 2021-04-20 Lingyu Zhu , Esa Rahtu

Co-Separating Sounds of Visual Objects

Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel. Current methods for visually-guided audio source separation sidestep the issue by training with artificially mixed video…

Computer Vision and Pattern Recognition · Computer Science 2019-08-22 Ruohan Gao , Kristen Grauman

Sound Adversarial Audio-Visual Navigation

Audio-visual navigation task requires an agent to find a sound source in a realistic, unmapped 3D environment by utilizing egocentric audio-visual observations. Existing audio-visual navigation works assume a clean environment that solely…

Sound · Computer Science 2022-02-23 Yinfeng Yu , Wenbing Huang , Fuchun Sun , Changan Chen , Yikai Wang , Xiaohong Liu

Learning to Separate Voices by Spatial Regions

We consider the problem of audio voice separation for binaural applications, such as earphones and hearing aids. While today's neural networks perform remarkably well (separating $4+$ sources with 2 microphones) they assume a known or fixed…

Sound · Computer Science 2022-07-18 Zhongweiyang Xu , Romit Roy Choudhury

Audio-visual Speech Separation with Adversarially Disentangled Visual Representation

Speech separation aims to separate individual voice from an audio mixture of multiple simultaneous talkers. Although audio-only approaches achieve satisfactory performance, they build on a strategy to handle the predefined conditions,…

Sound · Computer Science 2020-12-01 Peng Zhang , Jiaming Xu , Jing shi , Yunzhe Hao , Bo Xu

Look, Listen, and Act: Towards Audio-Visual Embodied Navigation

A crucial ability of mobile intelligent agents is to integrate the evidence from multiple sensory inputs in an environment and to make a sequence of actions to reach their goals. In this paper, we attempt to approach the problem of…

Computer Vision and Pattern Recognition · Computer Science 2020-03-10 Chuang Gan , Yiwei Zhang , Jiajun Wu , Boqing Gong , Joshua B. Tenenbaum

SoundSpaces: Audio-Visual Navigation in 3D Environments

Moving around in the world is naturally a multisensory experience, but today's embodied agents are deaf---restricted to solely their visual perception of the environment. We introduce audio-visual navigation for complex, acoustically and…

Computer Vision and Pattern Recognition · Computer Science 2020-08-25 Changan Chen , Unnat Jain , Carl Schissler , Sebastia Vicenc Amengual Gari , Ziad Al-Halah , Vamsi Krishna Ithapu , Philip Robinson , Kristen Grauman

A Deep Reinforcement Learning Approach for Audio-based Navigation and Audio Source Localization in Multi-speaker Environments

In this work we apply deep reinforcement learning to the problems of navigating a three-dimensional environment and inferring the locations of human speaker audio sources within, in the case where the only available information is the raw…

Sound · Computer Science 2021-11-30 Petros Giannakopoulos , Aggelos Pikrakis , Yannis Cotronis

Seeing Through Noise: Visually Driven Speaker Separation and Enhancement

Isolating the voice of a specific person while filtering out other voices or background noises is challenging when video is shot in noisy environments. We propose audio-visual methods to isolate the voice of a single speaker and eliminate…

Computer Vision and Pattern Recognition · Computer Science 2018-02-13 Aviv Gabbay , Ariel Ephrat , Tavi Halperin , Shmuel Peleg

Efficient Area-based and Speaker-Agnostic Source Separation

This paper introduces an area-based source separation method designed for virtual meeting scenarios. The aim is to preserve speech signals from an unspecified number of sources within a defined spatial area in front of a linear microphone…

Audio and Speech Processing · Electrical Eng. & Systems 2024-08-20 Martin Strauss , Okan Köpüklü

Catch Me If You Hear Me: Audio-Visual Navigation in Complex Unmapped Environments with Moving Sounds

Audio-visual navigation combines sight and hearing to navigate to a sound-emitting source in an unmapped environment. While recent approaches have demonstrated the benefits of audio input to detect and find the goal, they focus on clean and…

Sound · Computer Science 2023-01-04 Abdelrahman Younes , Daniel Honerkamp , Tim Welschehold , Abhinav Valada

Learning to Separate Object Sounds by Watching Unlabeled Video

Perceiving a scene most fully requires all the senses. Yet modeling how objects look and sound is challenging: most natural scenes and events contain multiple objects, and the audio track mixes all the sound sources together. We propose to…

Computer Vision and Pattern Recognition · Computer Science 2018-07-27 Ruohan Gao , Rogerio Feris , Kristen Grauman

Best of Both Worlds: Multi-task Audio-Visual Automatic Speech Recognition and Active Speaker Detection

Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker's face. However, when multiple candidate speakers are visible this traditionally requires…

Audio and Speech Processing · Electrical Eng. & Systems 2022-05-12 Otavio Braga , Olivier Siohan

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been…

Audio and Speech Processing · Electrical Eng. & Systems 2021-03-16 Daniel Michelsanti , Zheng-Hua Tan , Shi-Xiong Zhang , Yong Xu , Meng Yu , Dong Yu , Jesper Jensen

Look\&Listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

Active speaker detection and speech enhancement have become two increasingly attractive topics in audio-visual scenario understanding. According to their respective characteristics, the scheme of independently designed architecture has been…

Sound · Computer Science 2022-07-08 Junwen Xiong , Yu Zhou , Peng Zhang , Lei Xie , Wei Huang , Yufei Zha

A Deep Reinforcement Learning Approach to Audio-Based Navigation in a Multi-Speaker Environment

In this work we use deep reinforcement learning to create an autonomous agent that can navigate in a two-dimensional space using only raw auditory sensory information from the environment, a problem that has received very little attention…

Sound · Computer Science 2021-05-17 Petros Giannakopoulos , Aggelos Pikrakis , Yannis Cotronis

A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active Speaker Selection

Audio-visual automatic speech recognition is a promising approach to robust ASR under noisy conditions. However, up until recently it had been traditionally studied in isolation assuming the video of a single speaking face matches the…

Audio and Speech Processing · Electrical Eng. & Systems 2022-05-13 Otavio Braga , Olivier Siohan

Self-Supervised Audio-Visual Co-Segmentation

Segmenting objects in images and separating sound sources in audio are challenging tasks, in part because traditional approaches require large amounts of labeled data. In this paper we develop a neural network model for visual object…

Computer Vision and Pattern Recognition · Computer Science 2019-04-22 Andrew Rouditchenko , Hang Zhao , Chuang Gan , Josh McDermott , Antonio Torralba