Related papers: Visually Guided Sound Source Separation and Locali…

Weakly-supervised Audio-visual Sound Source Detection and Separation

Learning how to localize and separate individual object sounds in the audio channel of the video is a difficult task. Current state-of-the-art methods predict audio masks from artificially mixed spectrograms, known as Mix-and-Separate…

Computer Vision and Pattern Recognition · Computer Science 2021-04-07 Tanzila Rahman , Leonid Sigal

Audio-Visual Separation with Hierarchical Fusion and Representation Alignment

Self-supervised audio-visual source separation leverages natural correlations between audio and vision modalities to separate mixed audio signals. In this work, we first systematically analyse the performance of existing multimodal fusion…

Multimedia · Computer Science 2025-10-10 Han Hu , Dongheng Lin , Qiming Huang , Yuqi Hou , Hyung Jin Chang , Jianbo Jiao

Learning to Localize Sound Source in Visual Scenes

Visual events are usually accompanied by sounds in our daily lives. We pose the question: Can the machine learn the correspondence between visual scene and the sound, and localize the sound source only by observing sound and visual scene…

Computer Vision and Pattern Recognition · Computer Science 2019-02-18 Arda Senocak , Tae-Hyun Oh , Junsik Kim , Ming-Hsuan Yang , In So Kweon

Language-Guided Audio-Visual Source Separation via Trimodal Consistency

We propose a self-supervised approach for learning to perform audio source separation in videos based on natural language queries, using only unlabeled video and audio pairs as training data. A key challenge in this task is learning to…

Computer Vision and Pattern Recognition · Computer Science 2023-09-26 Reuben Tan , Arijit Ray , Andrea Burns , Bryan A. Plummer , Justin Salamon , Oriol Nieto , Bryan Russell , Kate Saenko

Move2Hear: Active Audio-Visual Source Separation

We introduce the active audio-visual source separation problem, where an agent must move intelligently in order to better isolate the sounds coming from an object of interest in its environment. The agent hears multiple audio sources…

Computer Vision and Pattern Recognition · Computer Science 2021-08-27 Sagnik Majumder , Ziad Al-Halah , Kristen Grauman

Leveraging Category Information for Single-Frame Visual Sound Source Separation

Visual sound source separation aims at identifying sound components from a given sound mixture with the presence of visual cues. Prior works have demonstrated impressive results, but with the expense of large multi-stage architectures and…

Computer Vision and Pattern Recognition · Computer Science 2021-04-19 Lingyu Zhu , Esa Rahtu

Self-Supervised Audio-Visual Co-Segmentation

Segmenting objects in images and separating sound sources in audio are challenging tasks, in part because traditional approaches require large amounts of labeled data. In this paper we develop a neural network model for visual object…

Computer Vision and Pattern Recognition · Computer Science 2019-04-22 Andrew Rouditchenko , Hang Zhao , Chuang Gan , Josh McDermott , Antonio Torralba

FlowGrad: Using Motion for Visual Sound Source Localization

Most recent work in visual sound source localization relies on semantic audio-visual representations learned in a self-supervised manner, and by design excludes temporal information present in videos. While it proves to be effective for…

Sound · Computer Science 2023-04-18 Rajsuryan Singh , Pablo Zinemanas , Xavier Serra , Juan Pablo Bello , Magdalena Fuentes

Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications

Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its…

Computer Vision and Pattern Recognition · Computer Science 2019-11-22 Arda Senocak , Tae-Hyun Oh , Junsik Kim , Ming-Hsuan Yang , In So Kweon

Co-Separating Sounds of Visual Objects

Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel. Current methods for visually-guided audio source separation sidestep the issue by training with artificially mixed video…

Computer Vision and Pattern Recognition · Computer Science 2019-08-22 Ruohan Gao , Kristen Grauman

Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding

The framework of visually-guided sound source separation generally consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing. An ongoing trend in this field has been to tailor involved visual…

Sound · Computer Science 2023-06-21 Zengjie Song , Zhaoxiang Zhang

Self-Supervised Learning of Audio-Visual Objects from Video

Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate…

Computer Vision and Pattern Recognition · Computer Science 2020-08-11 Triantafyllos Afouras , Andrew Owens , Joon Son Chung , Andrew Zisserman

Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation

There exists an unequivocal distinction between the sound produced by a static source and that produced by a moving one, especially when the source moves towards or away from the microphone. In this paper, we propose to use this connection…

Sound · Computer Science 2022-11-01 Moitreya Chatterjee , Narendra Ahuja , Anoop Cherian

Visual Scene Graphs for Audio Source Separation

State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments. These approaches often ignore the visual context of these sound sources or avoid…

Computer Vision and Pattern Recognition · Computer Science 2021-09-27 Moitreya Chatterjee , Jonathan Le Roux , Narendra Ahuja , Anoop Cherian

Self-Supervised Learning from Automatically Separated Sound Scenes

Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and…

Sound · Computer Science 2021-09-16 Eduardo Fonseca , Aren Jansen , Daniel P. W. Ellis , Scott Wisdom , Marco Tagliasacchi , John R. Hershey , Manoj Plakal , Shawn Hershey , R. Channing Moore , Xavier Serra

Visually Guided Self Supervised Learning of Speech Representations

Self supervised representation learning has recently attracted a lot of research interest for both the audio and visual modalities. However, most works typically focus on a particular modality or feature alone and there has been very…

Audio and Speech Processing · Electrical Eng. & Systems 2020-02-21 Abhinav Shukla , Konstantinos Vougioukas , Pingchuan Ma , Stavros Petridis , Maja Pantic

Leveraging Sound Source Trajectories for Universal Sound Separation

Existing methods utilizing spatial information for sound source separation require prior knowledge of the direction of arrival (DOA) of the source or utilize estimated but imprecise localization results, which impairs the separation…

Audio and Speech Processing · Electrical Eng. & Systems 2025-04-08 Donghang Wu , Xihong Wu , Tianshu Qu

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

The thud of a bouncing ball, the onset of speech as lips open -- when visual and audio events occur together, it suggests that there might be a common, underlying event that produced both signals. In this paper, we argue that the visual and…

Computer Vision and Pattern Recognition · Computer Science 2018-10-10 Andrew Owens , Alexei A. Efros

Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling

Detecting sound source objects within visual observation is important for autonomous robots to comprehend surrounding environments. Since sounding objects have a large variety with different appearances in our living environments, labeling…

Sound · Computer Science 2020-07-29 Yoshiki Masuyama , Yoshiaki Bando , Kohei Yatabe , Yoko Sasaki , Masaki Onishi , Yasuhiro Oikawa

Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

Self-supervised sound source localization is usually challenged by the modality inconsistency. In recent studies, contrastive learning based strategies have shown promising to establish such a consistent correspondence between audio and…

Computer Vision and Pattern Recognition · Computer Science 2023-08-10 Tianyu Liu , Peng Zhang , Wei Huang , Yufei Zha , Tao You , Yanning Zhang