Related papers: Dual Normalization Multitasking for Audio-Visual S…

Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling

Detecting sound source objects within visual observation is important for autonomous robots to comprehend surrounding environments. Since sounding objects have a large variety with different appearances in our living environments, labeling…

Sound · Computer Science 2020-07-29 Yoshiki Masuyama , Yoshiaki Bando , Kohei Yatabe , Yoko Sasaki , Masaki Onishi , Yasuhiro Oikawa

Multi-scale Multi-instance Visual Sound Localization and Segmentation

Visual sound localization is a typical and challenging problem that predicts the location of objects corresponding to the sound source in a video. Previous methods mainly used the audio-visual association between global audio and one-scale…

Computer Vision and Pattern Recognition · Computer Science 2024-09-04 Shentong Mo , Haofan Wang

Audio-Visual Grouping Network for Sound Localization from Mixtures

Sound source localization is a typical and challenging task that predicts the location of sound sources in a video. Previous single-source methods mainly used the audio-visual association as clues to localize sounding objects in each image.…

Computer Vision and Pattern Recognition · Computer Science 2023-03-31 Shentong Mo , Yapeng Tian

Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization

Audio-Visual Source Localization (AVSL) aims to locate sounding objects within video frames given the paired audio clips. Existing methods predominantly rely on self-supervised contrastive learning of audio-visual correspondence. Without…

Computer Vision and Pattern Recognition · Computer Science 2024-03-06 Yuxin Guo , Shijie Ma , Hu Su , Zhiqing Wang , Yuhao Zhao , Wei Zou , Siyang Sun , Yun Zheng

A Critical Assessment of Visual Sound Source Localization Models Including Negative Audio

The task of Visual Sound Source Localization (VSSL) involves identifying the location of sound sources in visual scenes, integrating audio-visual data for enhanced scene understanding. Despite advancements in state-of-the-art (SOTA) models,…

Computer Vision and Pattern Recognition · Computer Science 2025-01-14 Xavier Juanola , Gloria Haro , Magdalena Fuentes

Unveiling Visual Biases in Audio-Visual Localization Benchmarks

Audio-Visual Source Localization (AVSL) aims to localize the source of sound within a video. In this paper, we identify a significant issue in existing benchmarks: the sounding objects are often easily recognized based solely on visual…

Multimedia · Computer Science 2024-09-12 Liangyu Chen , Zihao Yue , Boshen Xu , Qin Jin

Multi-goal Audio-visual Navigation using Sound Direction Map

Over the past few years, there has been a great deal of research on navigation tasks in indoor environments using deep reinforcement learning agents. Most of these tasks use only visual information in the form of first-person images to…

Computer Vision and Pattern Recognition · Computer Science 2023-08-02 Haru Kondoh , Asako Kanezaki

Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos

Visual and audio events simultaneously occur and both attract attention. However, most existing saliency prediction works ignore the influence of audio and only consider vision modality. In this paper, we propose a multitask learning method…

Computer Vision and Pattern Recognition · Computer Science 2021-11-17 Minglang Qiao , Yufan Liu , Mai Xu , Xin Deng , Bing Li , Weiming Hu , Ali Borji

DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation

Audio-Visual Segmentation (AVS) aims to localize sound-producing objects at the pixel level by jointly leveraging auditory and visual information. However, existing methods often suffer from multi-source entanglement and audio-visual…

Computer Vision and Pattern Recognition · Computer Science 2025-12-24 Jingqi Tian , Yiheng Du , Haoji Zhang , Yuji Wang , Isaac Ning Lee , Xulong Bai , Tianrui Zhu , Jingxuan Niu , Yansong Tang

Learning to Unify Audio, Visual and Text for Audio-Enhanced Multilingual Visual Answer Localization

The goal of Multilingual Visual Answer Localization (MVAL) is to locate a video segment that answers a given multilingual question. Existing methods either focus solely on visual modality or integrate visual and subtitle modalities.…

Multimedia · Computer Science 2024-11-06 Zhibin Wen , Bin Li

A Closer Look at Weakly-Supervised Audio-Visual Source Localization

Audio-visual source localization is a challenging task that aims to predict the location of visual sound sources in a video. Since collecting ground-truth annotations of sounding objects can be costly, a plethora of weakly-supervised…

Sound · Computer Science 2022-09-21 Shentong Mo , Pedro Morgado

AV-SSAN: Audio-Visual Selective DoA Estimation through Explicit Multi-Band Semantic-Spatial Alignment

Audio-visual sound source localization (AV-SSL) estimates the position of sound sources by fusing auditory and visual cues. Current AV-SSL methodologies typically require spatially-paired audio-visual data and cannot selectively localize…

Sound · Computer Science 2025-08-07 Yu Chen , Hongxu Zhu , Jiadong Wang , Kainan Chen , Xinyuan Qian

Object-aware Sound Source Localization via Audio-Visual Scene Understanding

Audio-visual sound source localization task aims to spatially localize sound-making objects within visual scenes by integrating visual and audio cues. However, existing methods struggle with accurately localizing sound-making objects in…

Computer Vision and Pattern Recognition · Computer Science 2025-06-25 Sung Jin Um , Dongjin Kim , Sangmin Lee , Jung Uk Kim

Multi-target DoA Estimation with an Audio-visual Fusion Mechanism

Most of the prior studies in the spatial \ac{DoA} domain focus on a single modality. However, humans use auditory and visual senses to detect the presence of sound sources. With this motivation, we propose to use neural networks with audio…

Sound · Computer Science 2021-05-14 Xinyuan Qian , Maulik Madhavi , Zexu Pan , Jiadong Wang , Haizhou Li

Learning from Silence and Noise for Visual Sound Source Localization

Visual sound source localization is a fundamental perception task that aims to detect the location of sounding sources in a video given its audio. Despite recent progress, we identify two shortcomings in current methods: 1) most approaches…

Computer Vision and Pattern Recognition · Computer Science 2025-09-01 Xavier Juanola , Giovana Morais , Magdalena Fuentes , Gloria Haro

Co-Separating Sounds of Visual Objects

Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel. Current methods for visually-guided audio source separation sidestep the issue by training with artificially mixed video…

Computer Vision and Pattern Recognition · Computer Science 2019-08-22 Ruohan Gao , Kristen Grauman

AcousticFusion: Fusing Sound Source Localization to Visual SLAM in Dynamic Environments

Dynamic objects in the environment, such as people and other agents, lead to challenges for existing simultaneous localization and mapping (SLAM) approaches. To deal with dynamic environments, computer vision researchers usually apply some…

Robotics · Computer Science 2021-08-04 Tianwei Zhang , Huayan Zhang , Xiaofei Li , Junfeng Chen , Tin Lun Lam , Sethu Vijayakumar

Localizing Visual Sounds the Easy Way

Unsupervised audio-visual source localization aims at localizing visible sound sources in a video without relying on ground-truth localization for training. Previous works often seek high audio-visual similarities for likely positive…

Computer Vision and Pattern Recognition · Computer Science 2022-03-30 Shentong Mo , Pedro Morgado

Objects that Sound

In this paper our objectives are, first, networks that can embed audio and visual inputs into a common space that is suitable for cross-modal retrieval; and second, a network that can localize the object that sounds in an image, given the…

Computer Vision and Pattern Recognition · Computer Science 2018-07-27 Relja Arandjelović , Andrew Zisserman

Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds

Humans can robustly recognize and localize objects by integrating visual and auditory cues. While machines are able to do the same now with images, less work has been done with sounds. This work develops an approach for dense semantic…

Computer Vision and Pattern Recognition · Computer Science 2020-03-10 Arun Balajee Vasudevan , Dengxin Dai , Luc Van Gool