Related papers: Multi-scale Multi-instance Visual Sound Localizati…

Localizing Visual Sounds the Easy Way

Unsupervised audio-visual source localization aims at localizing visible sound sources in a video without relying on ground-truth localization for training. Previous works often seek high audio-visual similarities for likely positive…

Computer Vision and Pattern Recognition · Computer Science 2022-03-30 Shentong Mo , Pedro Morgado

Unveiling Visual Biases in Audio-Visual Localization Benchmarks

Audio-Visual Source Localization (AVSL) aims to localize the source of sound within a video. In this paper, we identify a significant issue in existing benchmarks: the sounding objects are often easily recognized based solely on visual…

Multimedia · Computer Science 2024-09-12 Liangyu Chen , Zihao Yue , Boshen Xu , Qin Jin

Multiple Sound Sources Localization from Coarse to Fine

How to visually localize multiple sound sources in unconstrained videos is a formidable problem, especially when lack of the pairwise sound-object annotations. To solve this problem, we develop a two-stage audiovisual learning framework…

Computer Vision and Pattern Recognition · Computer Science 2020-07-15 Rui Qian , Di Hu , Heinrich Dinkel , Mengyue Wu , Ning Xu , Weiyao Lin

How to Listen? Rethinking Visual Sound Localization

Localizing visual sounds consists on locating the position of objects that emit sound within an image. It is a growing research area with potential applications in monitoring natural and urban environments, such as wildlife migration and…

Sound · Computer Science 2022-04-12 Ho-Hsiang Wu , Magdalena Fuentes , Prem Seetharaman , Juan Pablo Bello

Object-aware Sound Source Localization via Audio-Visual Scene Understanding

Audio-visual sound source localization task aims to spatially localize sound-making objects within visual scenes by integrating visual and audio cues. However, existing methods struggle with accurately localizing sound-making objects in…

Computer Vision and Pattern Recognition · Computer Science 2025-06-25 Sung Jin Um , Dongjin Kim , Sangmin Lee , Jung Uk Kim

T-VSL: Text-Guided Visual Sound Source Localization in Mixtures

Visual sound source localization poses a significant challenge in identifying the semantic region of each sounding source within a video. Existing self-supervised and weakly supervised source localization methods struggle to accurately…

Computer Vision and Pattern Recognition · Computer Science 2024-07-09 Tanvir Mahmud , Yapeng Tian , Diana Marculescu

Audio-Visual Grouping Network for Sound Localization from Mixtures

Sound source localization is a typical and challenging task that predicts the location of sound sources in a video. Previous single-source methods mainly used the audio-visual association as clues to localize sounding objects in each image.…

Computer Vision and Pattern Recognition · Computer Science 2023-03-31 Shentong Mo , Yapeng Tian

Class-aware Sounding Objects Localization via Audiovisual Correspondence

Audiovisual scenes are pervasive in our daily life. It is commonplace for humans to discriminatively localize different sounding objects but quite challenging for machines to achieve class-aware sounding objects localization without…

Computer Vision and Pattern Recognition · Computer Science 2021-12-23 Di Hu , Yake Wei , Rui Qian , Weiyao Lin , Ruihua Song , Ji-Rong Wen

Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics

The audio-visual segmentation (AVS) task aims to segment sounding objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve sounding object masks. However, we observed that prior…

Sound · Computer Science 2023-08-02 Chen Liu , Peike Li , Xingqun Qi , Hu Zhang , Lincheng Li , Dadong Wang , Xin Yu

A Critical Assessment of Visual Sound Source Localization Models Including Negative Audio

The task of Visual Sound Source Localization (VSSL) involves identifying the location of sound sources in visual scenes, integrating audio-visual data for enhanced scene understanding. Despite advancements in state-of-the-art (SOTA) models,…

Computer Vision and Pattern Recognition · Computer Science 2025-01-14 Xavier Juanola , Gloria Haro , Magdalena Fuentes

Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

The goal of the multi-sound source localization task is to localize sound sources from the mixture individually. While recent multi-sound source localization methods have shown improved performance, they face challenges due to their…

Computer Vision and Pattern Recognition · Computer Science 2024-04-04 Dongjin Kim , Sung Jin Um , Sangmin Lee , Jung Uk Kim

Learning from Silence and Noise for Visual Sound Source Localization

Visual sound source localization is a fundamental perception task that aims to detect the location of sounding sources in a video given its audio. Despite recent progress, we identify two shortcomings in current methods: 1) most approaches…

Computer Vision and Pattern Recognition · Computer Science 2025-09-01 Xavier Juanola , Giovana Morais , Magdalena Fuentes , Gloria Haro

Learning to Unify Audio, Visual and Text for Audio-Enhanced Multilingual Visual Answer Localization

The goal of Multilingual Visual Answer Localization (MVAL) is to locate a video segment that answers a given multilingual question. Existing methods either focus solely on visual modality or integrate visual and subtitle modalities.…

Multimedia · Computer Science 2024-11-06 Zhibin Wen , Bin Li

Weakly-supervised Audio-visual Sound Source Detection and Separation

Learning how to localize and separate individual object sounds in the audio channel of the video is a difficult task. Current state-of-the-art methods predict audio masks from artificially mixed spectrograms, known as Mix-and-Separate…

Computer Vision and Pattern Recognition · Computer Science 2021-04-07 Tanzila Rahman , Leonid Sigal

Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications

Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its…

Computer Vision and Pattern Recognition · Computer Science 2019-11-22 Arda Senocak , Tae-Hyun Oh , Junsik Kim , Ming-Hsuan Yang , In So Kweon

From Waveforms to Pixels: A Survey on Audio-Visual Segmentation

Audio-Visual Segmentation (AVS) aims to identify and segment sound-producing objects in videos by leveraging both visual and audio modalities. It has emerged as a significant research area in multimodal perception, enabling fine-grained…

Computer Vision and Pattern Recognition · Computer Science 2025-08-07 Jia Li , Yapeng Tian

Sound Source Localization is All about Cross-Modal Alignment

Humans can easily perceive the direction of sound sources in a visual scene, termed sound source localization. Recent studies on learning-based sound source localization have mainly explored the problem from a localization perspective.…

Computer Vision and Pattern Recognition · Computer Science 2023-09-20 Arda Senocak , Hyeonggon Ryu , Junsik Kim , Tae-Hyun Oh , Hanspeter Pfister , Joon Son Chung

Localizing Visual Sounds the Hard Way

The objective of this work is to localize sound sources that are visible in a video without using manual annotations. Our key technical contribution is to show that, by training the network to explicitly discriminate challenging image…

Computer Vision and Pattern Recognition · Computer Science 2021-04-07 Honglie Chen , Weidi Xie , Triantafyllos Afouras , Arsha Nagrani , Andrea Vedaldi , Andrew Zisserman

3D Audio-Visual Segmentation

Recognizing the sounding objects in scenes is a longstanding objective in embodied AI, with diverse applications in robotics and AR/VR/MR. To that end, Audio-Visual Segmentation (AVS), taking as condition an audio signal to identify the…

Computer Vision and Pattern Recognition · Computer Science 2025-10-22 Artem Sokolov , Swapnil Bhosale , Xiatian Zhu

AV-SSAN: Audio-Visual Selective DoA Estimation through Explicit Multi-Band Semantic-Spatial Alignment

Audio-visual sound source localization (AV-SSL) estimates the position of sound sources by fusing auditory and visual cues. Current AV-SSL methodologies typically require spatially-paired audio-visual data and cannot selectively localize…

Sound · Computer Science 2025-08-07 Yu Chen , Hongxu Zhu , Jiadong Wang , Kainan Chen , Xinyuan Qian