Related papers: Multichannel-based learning for audio object extra…

Co-Separating Sounds of Visual Objects

Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel. Current methods for visually-guided audio source separation sidestep the issue by training with artificially mixed video…

Computer Vision and Pattern Recognition · Computer Science 2019-08-22 Ruohan Gao , Kristen Grauman

Object-Based Audio Rendering

Apparatus and methods are disclosed for performing object-based audio rendering on a plurality of audio objects which define a sound scene, each audio object comprising at least one audio signal and associated metadata. The apparatus…

Sound · Computer Science 2017-08-25 Philip Jackson , Filippo Fazi , Frank Melchior , Trevor Cox , Adrian Hilton , Chris Pike , Jon Francombe , Andreas Franck , Philip Coleman , Dylan Menzies-Gow , James Woodcock , Yan Tang , Qingju Liu , Rick Hughes , Marcos Simon Galvez , Teo de Campos , Hansung Kim , Hanne Stenzel

Learning to Separate Object Sounds by Watching Unlabeled Video

Perceiving a scene most fully requires all the senses. Yet modeling how objects look and sound is challenging: most natural scenes and events contain multiple objects, and the audio track mixes all the sound sources together. We propose to…

Computer Vision and Pattern Recognition · Computer Science 2018-07-27 Ruohan Gao , Rogerio Feris , Kristen Grauman

Self-Supervised Learning of Audio-Visual Objects from Video

Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate…

Computer Vision and Pattern Recognition · Computer Science 2020-08-11 Triantafyllos Afouras , Andrew Owens , Joon Son Chung , Andrew Zisserman

Segmenting Moving Objects via an Object-Centric Layered Representation

The objective of this paper is a model that is able to discover, track and segment multiple moving objects in a video. We make four contributions: First, we introduce an object-centric segmentation model with a depth-ordered layer…

Computer Vision and Pattern Recognition · Computer Science 2022-11-15 Junyu Xie , Weidi Xie , Andrew Zisserman

Sounding that Object: Interactive Object-Aware Image to Audio Generation

Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an {\em interactive object-aware audio generation} model that grounds…

Computer Vision and Pattern Recognition · Computer Science 2025-06-05 Tingle Li , Baihe Huang , Xiaobin Zhuang , Dongya Jia , Jiawei Chen , Yuping Wang , Zhuo Chen , Gopala Anumanchipalli , Yuxuan Wang

Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision

We tackle the problem of audiovisual scene analysis for weakly-labeled data. To this end, we build upon our previous audiovisual representation learning framework to perform object classification in noisy acoustic environments and integrate…

Computer Vision and Pattern Recognition · Computer Science 2018-11-12 Sanjeel Parekh , Alexey Ozerov , Slim Essid , Ngoc Duong , Patrick Pérez , Gaël Richard

DeepASA: An Object-Oriented Multi-Purpose Network for Auditory Scene Analysis

We propose DeepASA, a multi-purpose model for auditory scene analysis that performs multi-input multi-output (MIMO) source separation, dereverberation, sound event detection (SED), audio classification, and direction-of-arrival estimation…

Audio and Speech Processing · Electrical Eng. & Systems 2026-04-16 Dongheon Lee , Younghoo Kwon , Jung-Woo Choi

Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds

Humans can robustly recognize and localize objects by integrating visual and auditory cues. While machines are able to do the same now with images, less work has been done with sounds. This work develops an approach for dense semantic…

Computer Vision and Pattern Recognition · Computer Science 2020-03-10 Arun Balajee Vasudevan , Dengxin Dai , Luc Van Gool

Multi-encoder attention-based architectures for sound recognition with partial visual assistance

Large-scale sound recognition data sets typically consist of acoustic recordings obtained from multimedia libraries. As a consequence, modalities other than audio can often be exploited to improve the outputs of models designed for…

Audio and Speech Processing · Electrical Eng. & Systems 2022-10-11 Wim Boes , Hugo Van hamme

Class-aware Sounding Objects Localization via Audiovisual Correspondence

Audiovisual scenes are pervasive in our daily life. It is commonplace for humans to discriminatively localize different sounding objects but quite challenging for machines to achieve class-aware sounding objects localization without…

Computer Vision and Pattern Recognition · Computer Science 2021-12-23 Di Hu , Yake Wei , Rui Qian , Weiyao Lin , Ruihua Song , Ji-Rong Wen

Weakly-supervised Audio-visual Sound Source Detection and Separation

Learning how to localize and separate individual object sounds in the audio channel of the video is a difficult task. Current state-of-the-art methods predict audio masks from artificially mixed spectrograms, known as Mix-and-Separate…

Computer Vision and Pattern Recognition · Computer Science 2021-04-07 Tanzila Rahman , Leonid Sigal

Object Pursuit: Building a Space of Objects via Discriminative Weight Generation

We propose a framework to continuously learn object-centric representations for visual learning and understanding. Existing object-centric representations either rely on supervisions that individualize objects in the scene, or perform…

Computer Vision and Pattern Recognition · Computer Science 2022-04-05 Chuanyu Pan , Yanchao Yang , Kaichun Mo , Yueqi Duan , Leonidas Guibas

Object Segmentation with Audio Context

Visual objects often have acoustic signatures that are naturally synchronized with them in audio-bearing video recordings. For this project, we explore the multimodal feature aggregation for video instance segmentation task, in which we…

Computer Vision and Pattern Recognition · Computer Science 2023-01-26 Kaihui Zheng , Yuqing Ren , Zixin Shen , Tianxu Qin

Real-Time Object Tracking with On-Device Deep Learning for Adaptive Beamforming in Dynamic Acoustic Environments

Advances in object tracking and acoustic beamforming are driving new capabilities in surveillance, human-computer interaction, and robotics. This work presents an embedded system that integrates deep learning-based tracking with beamforming…

Sound · Computer Science 2025-11-25 Jorge Ortigoso-Narro , Jose A. Belloch , Adrian Amor-Martin , Sandra Roger , Maximo Cobos

Multi-Object Representation Learning with Iterative Variational Inference

Human perception is structured around objects which form the basis for our higher-level cognition and impressive systematic generalization abilities. Yet most work on representation learning focuses on feature learning without even…

Machine Learning · Computer Science 2020-07-29 Klaus Greff , Raphaël Lopez Kaufman , Rishabh Kabra , Nick Watters , Chris Burgess , Daniel Zoran , Loic Matthey , Matthew Botvinick , Alexander Lerchner

Submodular video object proposal selection for semantic object segmentation

Learning a data-driven spatio-temporal semantic representation of the objects is the key to coherent and consistent labelling in video. This paper proposes to achieve semantic video object segmentation by learning a data-driven…

Computer Vision and Pattern Recognition · Computer Science 2024-07-09 Tinghuai Wang

Self-Supervised Audio-Visual Co-Segmentation

Segmenting objects in images and separating sound sources in audio are challenging tasks, in part because traditional approaches require large amounts of labeled data. In this paper we develop a neural network model for visual object…

Computer Vision and Pattern Recognition · Computer Science 2019-04-22 Andrew Rouditchenko , Hang Zhao , Chuang Gan , Josh McDermott , Antonio Torralba

Estimating Visual Information From Audio Through Manifold Learning

We propose a new framework for extracting visual information about a scene only using audio signals. Audio-based methods can overcome some of the limitations of vision-based methods i.e., they do not require "line-of-sight", are robust to…

Computer Vision and Pattern Recognition · Computer Science 2022-09-14 Fabrizio Pedersoli , Dryden Wiebe , Amin Banitalebi , Yong Zhang , George Tzanetakis , Kwang Moo Yi

Self-supervised object detection from audio-visual correspondence

We tackle the problem of learning object detectors without supervision. Differently from weakly-supervised object detection, we do not assume image-level class labels. Instead, we extract a supervisory signal from audio-visual data, using…

Computer Vision and Pattern Recognition · Computer Science 2022-07-12 Triantafyllos Afouras , Yuki M. Asano , Francois Fagan , Andrea Vedaldi , Florian Metze