Related papers: Self-Supervised Audio-Visual Co-Segmentation

Co-Separating Sounds of Visual Objects

Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel. Current methods for visually-guided audio source separation sidestep the issue by training with artificially mixed video…

Computer Vision and Pattern Recognition · Computer Science 2019-08-22 Ruohan Gao , Kristen Grauman

Weakly-supervised Audio-visual Sound Source Detection and Separation

Learning how to localize and separate individual object sounds in the audio channel of the video is a difficult task. Current state-of-the-art methods predict audio masks from artificially mixed spectrograms, known as Mix-and-Separate…

Computer Vision and Pattern Recognition · Computer Science 2021-04-07 Tanzila Rahman , Leonid Sigal

Learning to Separate Object Sounds by Watching Unlabeled Video

Perceiving a scene most fully requires all the senses. Yet modeling how objects look and sound is challenging: most natural scenes and events contain multiple objects, and the audio track mixes all the sound sources together. We propose to…

Computer Vision and Pattern Recognition · Computer Science 2018-07-27 Ruohan Gao , Rogerio Feris , Kristen Grauman

Self-Supervised Learning of Audio-Visual Objects from Video

Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate…

Computer Vision and Pattern Recognition · Computer Science 2020-08-11 Triantafyllos Afouras , Andrew Owens , Joon Son Chung , Andrew Zisserman

Language-Guided Audio-Visual Source Separation via Trimodal Consistency

We propose a self-supervised approach for learning to perform audio source separation in videos based on natural language queries, using only unlabeled video and audio pairs as training data. A key challenge in this task is learning to…

Computer Vision and Pattern Recognition · Computer Science 2023-09-26 Reuben Tan , Arijit Ray , Andrea Burns , Bryan A. Plummer , Justin Salamon , Oriol Nieto , Bryan Russell , Kate Saenko

Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling

Detecting sound source objects within visual observation is important for autonomous robots to comprehend surrounding environments. Since sounding objects have a large variety with different appearances in our living environments, labeling…

Sound · Computer Science 2020-07-29 Yoshiki Masuyama , Yoshiaki Bando , Kohei Yatabe , Yoko Sasaki , Masaki Onishi , Yasuhiro Oikawa

Semantic Grouping Network for Audio Source Separation

Recently, audio-visual separation approaches have taken advantage of the natural synchronization between the two modalities to boost audio source separation performance. They extracted high-level semantics from visual inputs as the guidance…

Sound · Computer Science 2024-07-08 Shentong Mo , Yapeng Tian

Self-supervised object detection from audio-visual correspondence

We tackle the problem of learning object detectors without supervision. Differently from weakly-supervised object detection, we do not assume image-level class labels. Instead, we extract a supervisory signal from audio-visual data, using…

Computer Vision and Pattern Recognition · Computer Science 2022-07-12 Triantafyllos Afouras , Yuki M. Asano , Francois Fagan , Andrea Vedaldi , Florian Metze

Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics

The audio-visual segmentation (AVS) task aims to segment sounding objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve sounding object masks. However, we observed that prior…

Sound · Computer Science 2023-08-02 Chen Liu , Peike Li , Xingqun Qi , Hu Zhang , Lincheng Li , Dadong Wang , Xin Yu

SegSort: Segmentation by Discriminative Sorting of Segments

Almost all existing deep learning approaches for semantic segmentation tackle this task as a pixel-wise classification problem. Yet humans understand a scene not in terms of pixels, but by decomposing it into perceptual groups and…

Computer Vision and Pattern Recognition · Computer Science 2019-10-31 Jyh-Jing Hwang , Stella X. Yu , Jianbo Shi , Maxwell D. Collins , Tien-Ju Yang , Xiao Zhang , Liang-Chieh Chen

Improving On-Screen Sound Separation for Open-Domain Videos with Audio-Visual Self-Attention

We introduce a state-of-the-art audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify limitations of previous…

Sound · Computer Science 2021-10-15 Efthymios Tzinis , Scott Wisdom , Tal Remez , John R. Hershey

Pixel Objectness: Learning to Segment Generic Objects Automatically in Images and Videos

We propose an end-to-end learning framework for segmenting generic objects in both images and videos. Given a novel image or video, our approach produces a pixel-level mask for all "object-like" regions---even for object categories never…

Computer Vision and Pattern Recognition · Computer Science 2018-12-19 Bo Xiong , Suyog Dutt Jain , Kristen Grauman

Self-Supervised Learning from Automatically Separated Sound Scenes

Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and…

Sound · Computer Science 2021-09-16 Eduardo Fonseca , Aren Jansen , Daniel P. W. Ellis , Scott Wisdom , Marco Tagliasacchi , John R. Hershey , Manoj Plakal , Shawn Hershey , R. Channing Moore , Xavier Serra

Learning to Separate Voices by Spatial Regions

We consider the problem of audio voice separation for binaural applications, such as earphones and hearing aids. While today's neural networks perform remarkably well (separating $4+$ sources with 2 microphones) they assume a known or fixed…

Sound · Computer Science 2022-07-18 Zhongweiyang Xu , Romit Roy Choudhury

A Knowledge-Driven Approach to Music Segmentation, Music Source Separation and Cinematic Audio Source Separation

We propose a knowledge-driven, model-based approach to segmenting audio into single-category and mixed-category chunks with applications to source separation. "Knowledge" here denotes information associated with the data, such as music…

Audio and Speech Processing · Electrical Eng. & Systems 2026-02-26 Chun-wei Ho , Sabato Marco Siniscalchi , Kai Li , Chin-Hui Lee

From Waveforms to Pixels: A Survey on Audio-Visual Segmentation

Audio-Visual Segmentation (AVS) aims to identify and segment sound-producing objects in videos by leveraging both visual and audio modalities. It has emerged as a significant research area in multimodal perception, enabling fine-grained…

Computer Vision and Pattern Recognition · Computer Science 2025-08-07 Jia Li , Yapeng Tian

Audio-Visual Segmentation

We propose to explore a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the…

Computer Vision and Pattern Recognition · Computer Science 2023-02-20 Jinxing Zhou , Jianyuan Wang , Jiayi Zhang , Weixuan Sun , Jing Zhang , Stan Birchfield , Dan Guo , Lingpeng Kong , Meng Wang , Yiran Zhong

Self-supervised Semantic Segmentation Grounded in Visual Concepts

Unsupervised semantic segmentation requires assigning a label to every pixel without any human annotations. Despite recent advances in self-supervised representation learning for individual images, unsupervised semantic segmentation with…

Computer Vision and Pattern Recognition · Computer Science 2022-07-27 Wenbin He , William Surmeier , Arvind Kumar Shekar , Liang Gou , Liu Ren

The Sound of Pixels

We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos, learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel.…

Computer Vision and Pattern Recognition · Computer Science 2018-10-16 Hang Zhao , Chuang Gan , Andrew Rouditchenko , Carl Vondrick , Josh McDermott , Antonio Torralba

Object Segmentation with Audio Context

Visual objects often have acoustic signatures that are naturally synchronized with them in audio-bearing video recordings. For this project, we explore the multimodal feature aggregation for video instance segmentation task, in which we…

Computer Vision and Pattern Recognition · Computer Science 2023-01-26 Kaihui Zheng , Yuqing Ren , Zixin Shen , Tianxu Qin