Related papers: Compositional Audio Representation Learning

Unsupervised Composable Representations for Audio

Current generative models are able to generate high-quality artefacts but have been shown to struggle with compositional reasoning, which can be defined as the ability to generate complex structures from simpler elements. In this paper, we…

Machine Learning · Computer Science 2024-08-20 Giovanni Bindi , Philippe Esling

Evaluating Compositional Structure in Audio Representations

We propose a benchmark for evaluating compositionality in audio representations. Audio compositionality refers to representing sound scenes in terms of constituent sources and attributes, and combining them systematically. While central to…

Sound · Computer Science 2026-03-17 Chuyang Chen , Bea Steers , Brian McFee , Juan Bello

An efficient supervised dictionary learning method for audio signal recognition

Machine hearing or listening represents an emerging area. Conventional approaches rely on the design of handcrafted features specialized to a specific audio task and that can hardly generalized to other audio fields. For example,…

Computer Vision and Pattern Recognition · Computer Science 2018-12-13 Imad Rida , Romain Hérault , Gilles Gasso

Self-Supervised Learning from Automatically Separated Sound Scenes

Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and…

Sound · Computer Science 2021-09-16 Eduardo Fonseca , Aren Jansen , Daniel P. W. Ellis , Scott Wisdom , Marco Tagliasacchi , John R. Hershey , Manoj Plakal , Shawn Hershey , R. Channing Moore , Xavier Serra

Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision

Humans do not acquire perceptual abilities in the way we train machines. While machine learning algorithms typically operate on large collections of randomly-chosen, explicitly-labeled examples, human acquisition relies more heavily on…

Sound · Computer Science 2019-11-15 Aren Jansen , Daniel P. W. Ellis , Shawn Hershey , R. Channing Moore , Manoj Plakal , Ashok C. Popat , Rif A. Saurous

Learning Self-Supervised Audio-Visual Representations for Sound Recommendations

We propose a novel self-supervised approach for learning audio and visual representations from unlabeled videos, based on their correspondence. The approach uses an attention mechanism to learn the relative importance of convolutional…

Computer Vision and Pattern Recognition · Computer Science 2024-12-11 Sudha Krishnamurthy

Compositional Scene Representation Learning via Reconstruction: A Survey

Visual scenes are composed of visual concepts and have the property of combinatorial explosion. An important reason for humans to efficiently learn from diverse visual scenes is the ability of compositional perception, and it is desirable…

Machine Learning · Computer Science 2023-06-16 Jinyang Yuan , Tonglin Chen , Bin Li , Xiangyang Xue

Learning to Compose: Improving Object Centric Learning by Injecting Compositionality

Learning compositional representation is a key aspect of object-centric learning as it enables flexible systematic generalization and supports complex visual reasoning. However, most of the existing approaches rely on auto-encoding…

Computer Vision and Pattern Recognition · Computer Science 2025-11-11 Whie Jung , Jaehoon Yoo , Sungjin Ahn , Seunghoon Hong

Enriched Music Representations with Multiple Cross-modal Contrastive Learning

Modeling various aspects that make a music piece unique is a challenging task, requiring the combination of multiple sources of information. Deep learning is commonly used to obtain representations using various sources of information, such…

Sound · Computer Science 2021-04-05 Andres Ferraro , Xavier Favory , Konstantinos Drossos , Yuntae Kim , Dmitry Bogdanov

Measuring Compositionality in Representation Learning

Many machine learning algorithms represent input data with vector embeddings or discrete codes. When inputs exhibit compositional structure (e.g. objects built from parts or procedures from subroutines), it is natural to ask whether this…

Machine Learning · Computer Science 2019-04-09 Jacob Andreas

Contrastive Environmental Sound Representation Learning

Machine hearing of the environmental sound is one of the important issues in the audio recognition domain. It gives the machine the ability to discriminate between the different input sounds that guides its decision making. In this work we…

Sound · Computer Science 2022-07-20 Peter Ochieng , Dennis Kaburu

Visually Guided Sound Source Separation and Localization using Self-Supervised Motion Representations

The objective of this paper is to perform audio-visual sound source separation, i.e.~to separate component audios from a mixture based on the videos of sound sources. Moreover, we aim to pinpoint the source location in the input video…

Computer Vision and Pattern Recognition · Computer Science 2021-04-20 Lingyu Zhu , Esa Rahtu

Data-driven audio recognition: a supervised dictionary approach

Machine hearing is an emerging area. Motivated by the need of a principled framework across domain applications for machine listening, we propose a generic and data-driven representation learning approach. For this sake, a novel and…

Sound · Computer Science 2021-01-01 Imad Rida

Learning to Localize Sound Source in Visual Scenes

Visual events are usually accompanied by sounds in our daily lives. We pose the question: Can the machine learn the correspondence between visual scene and the sound, and localize the sound source only by observing sound and visual scene…

Computer Vision and Pattern Recognition · Computer Science 2019-02-18 Arda Senocak , Tae-Hyun Oh , Junsik Kim , Ming-Hsuan Yang , In So Kweon

A Model You Can Hear: Audio Identification with Playable Prototypes

Machine learning techniques have proved useful for classifying and analyzing audio content. However, recent methods typically rely on abstract and high-dimensional representations that are difficult to interpret. Inspired by…

Sound · Computer Science 2022-08-08 Romain Loiseau , Baptiste Bouvier , Yann Teytaut , Elliot Vincent , Mathieu Aubry , Loic Landrieu

Self-supervised Visual Reinforcement Learning with Object-centric Representations

Autonomous agents need large repertoires of skills to act reasonably on new tasks that they have not seen before. However, acquiring these skills using only a stream of high-dimensional, unstructured, and unlabeled observations is a tricky…

Machine Learning · Computer Science 2021-02-09 Andrii Zadaianchuk , Maximilian Seitzer , Georg Martius

Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications

Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its…

Computer Vision and Pattern Recognition · Computer Science 2019-11-22 Arda Senocak , Tae-Hyun Oh , Junsik Kim , Ming-Hsuan Yang , In So Kweon

Learning Compositional Representations for Few-Shot Recognition

One of the key limitations of modern deep learning approaches lies in the amount of data required to train them. Humans, by contrast, can learn to recognize novel categories from just a few examples. Instrumental to this rapid learning…

Computer Vision and Pattern Recognition · Computer Science 2019-08-20 Pavel Tokmakov , Yu-Xiong Wang , Martial Hebert

Audio Language Modeling using Perceptually-Guided Discrete Representations

In this work, we study the task of Audio Language Modeling, in which we aim at learning probabilistic models for audio that can be used for generation and completion. We use a state-of-the-art perceptually-guided audio compression model, to…

Sound · Computer Science 2022-11-07 Felix Kreuk , Yaniv Taigman , Adam Polyak , Jade Copet , Gabriel Synnaeve , Alexandre Défossez , Yossi Adi

Learning Audio Concepts from Counterfactual Natural Language

Conventional audio classification relied on predefined classes, lacking the ability to learn from free-form text. Recent methods unlock learning joint audio-text embeddings from raw audio-text pairs describing audio in natural language.…

Multimedia · Computer Science 2024-01-11 Ali Vosoughi , Luca Bondi , Ho-Hsiang Wu , Chenliang Xu