English
Related papers

Related papers: Self-Supervised Audio-Visual Co-Segmentation

200 papers

Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel. Current methods for visually-guided audio source separation sidestep the issue by training with artificially mixed video…

Computer Vision and Pattern Recognition · Computer Science 2019-08-22 Ruohan Gao , Kristen Grauman

Learning how to localize and separate individual object sounds in the audio channel of the video is a difficult task. Current state-of-the-art methods predict audio masks from artificially mixed spectrograms, known as Mix-and-Separate…

Computer Vision and Pattern Recognition · Computer Science 2021-04-07 Tanzila Rahman , Leonid Sigal

Perceiving a scene most fully requires all the senses. Yet modeling how objects look and sound is challenging: most natural scenes and events contain multiple objects, and the audio track mixes all the sound sources together. We propose to…

Computer Vision and Pattern Recognition · Computer Science 2018-07-27 Ruohan Gao , Rogerio Feris , Kristen Grauman

Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate…

Computer Vision and Pattern Recognition · Computer Science 2020-08-11 Triantafyllos Afouras , Andrew Owens , Joon Son Chung , Andrew Zisserman

We propose a self-supervised approach for learning to perform audio source separation in videos based on natural language queries, using only unlabeled video and audio pairs as training data. A key challenge in this task is learning to…

Computer Vision and Pattern Recognition · Computer Science 2023-09-26 Reuben Tan , Arijit Ray , Andrea Burns , Bryan A. Plummer , Justin Salamon , Oriol Nieto , Bryan Russell , Kate Saenko

Detecting sound source objects within visual observation is important for autonomous robots to comprehend surrounding environments. Since sounding objects have a large variety with different appearances in our living environments, labeling…

Sound · Computer Science 2020-07-29 Yoshiki Masuyama , Yoshiaki Bando , Kohei Yatabe , Yoko Sasaki , Masaki Onishi , Yasuhiro Oikawa

Recently, audio-visual separation approaches have taken advantage of the natural synchronization between the two modalities to boost audio source separation performance. They extracted high-level semantics from visual inputs as the guidance…

Sound · Computer Science 2024-07-08 Shentong Mo , Yapeng Tian

We tackle the problem of learning object detectors without supervision. Differently from weakly-supervised object detection, we do not assume image-level class labels. Instead, we extract a supervisory signal from audio-visual data, using…

Computer Vision and Pattern Recognition · Computer Science 2022-07-12 Triantafyllos Afouras , Yuki M. Asano , Francois Fagan , Andrea Vedaldi , Florian Metze

The audio-visual segmentation (AVS) task aims to segment sounding objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve sounding object masks. However, we observed that prior…

Sound · Computer Science 2023-08-02 Chen Liu , Peike Li , Xingqun Qi , Hu Zhang , Lincheng Li , Dadong Wang , Xin Yu

Almost all existing deep learning approaches for semantic segmentation tackle this task as a pixel-wise classification problem. Yet humans understand a scene not in terms of pixels, but by decomposing it into perceptual groups and…

Computer Vision and Pattern Recognition · Computer Science 2019-10-31 Jyh-Jing Hwang , Stella X. Yu , Jianbo Shi , Maxwell D. Collins , Tien-Ju Yang , Xiao Zhang , Liang-Chieh Chen

We introduce a state-of-the-art audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify limitations of previous…

Sound · Computer Science 2021-10-15 Efthymios Tzinis , Scott Wisdom , Tal Remez , John R. Hershey

We propose an end-to-end learning framework for segmenting generic objects in both images and videos. Given a novel image or video, our approach produces a pixel-level mask for all "object-like" regions---even for object categories never…

Computer Vision and Pattern Recognition · Computer Science 2018-12-19 Bo Xiong , Suyog Dutt Jain , Kristen Grauman

Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and…

We consider the problem of audio voice separation for binaural applications, such as earphones and hearing aids. While today's neural networks perform remarkably well (separating $4+$ sources with 2 microphones) they assume a known or fixed…

Sound · Computer Science 2022-07-18 Zhongweiyang Xu , Romit Roy Choudhury

We propose a knowledge-driven, model-based approach to segmenting audio into single-category and mixed-category chunks with applications to source separation. "Knowledge" here denotes information associated with the data, such as music…

Audio and Speech Processing · Electrical Eng. & Systems 2026-02-26 Chun-wei Ho , Sabato Marco Siniscalchi , Kai Li , Chin-Hui Lee

Audio-Visual Segmentation (AVS) aims to identify and segment sound-producing objects in videos by leveraging both visual and audio modalities. It has emerged as a significant research area in multimodal perception, enabling fine-grained…

Computer Vision and Pattern Recognition · Computer Science 2025-08-07 Jia Li , Yapeng Tian

We propose to explore a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the…

Computer Vision and Pattern Recognition · Computer Science 2023-02-20 Jinxing Zhou , Jianyuan Wang , Jiayi Zhang , Weixuan Sun , Jing Zhang , Stan Birchfield , Dan Guo , Lingpeng Kong , Meng Wang , Yiran Zhong

Unsupervised semantic segmentation requires assigning a label to every pixel without any human annotations. Despite recent advances in self-supervised representation learning for individual images, unsupervised semantic segmentation with…

Computer Vision and Pattern Recognition · Computer Science 2022-07-27 Wenbin He , William Surmeier , Arvind Kumar Shekar , Liang Gou , Liu Ren

We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos, learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel.…

Computer Vision and Pattern Recognition · Computer Science 2018-10-16 Hang Zhao , Chuang Gan , Andrew Rouditchenko , Carl Vondrick , Josh McDermott , Antonio Torralba

Visual objects often have acoustic signatures that are naturally synchronized with them in audio-bearing video recordings. For this project, we explore the multimodal feature aggregation for video instance segmentation task, in which we…

Computer Vision and Pattern Recognition · Computer Science 2023-01-26 Kaihui Zheng , Yuqing Ren , Zixin Shen , Tianxu Qin
‹ Prev 1 2 3 10 Next ›