English
Related papers

Related papers: Compositional Audio Representation Learning

200 papers

Current generative models are able to generate high-quality artefacts but have been shown to struggle with compositional reasoning, which can be defined as the ability to generate complex structures from simpler elements. In this paper, we…

Machine Learning · Computer Science 2024-08-20 Giovanni Bindi , Philippe Esling

We propose a benchmark for evaluating compositionality in audio representations. Audio compositionality refers to representing sound scenes in terms of constituent sources and attributes, and combining them systematically. While central to…

Sound · Computer Science 2026-03-17 Chuyang Chen , Bea Steers , Brian McFee , Juan Bello

Machine hearing or listening represents an emerging area. Conventional approaches rely on the design of handcrafted features specialized to a specific audio task and that can hardly generalized to other audio fields. For example,…

Computer Vision and Pattern Recognition · Computer Science 2018-12-13 Imad Rida , Romain Hérault , Gilles Gasso

Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and…

Humans do not acquire perceptual abilities in the way we train machines. While machine learning algorithms typically operate on large collections of randomly-chosen, explicitly-labeled examples, human acquisition relies more heavily on…

We propose a novel self-supervised approach for learning audio and visual representations from unlabeled videos, based on their correspondence. The approach uses an attention mechanism to learn the relative importance of convolutional…

Computer Vision and Pattern Recognition · Computer Science 2024-12-11 Sudha Krishnamurthy

Visual scenes are composed of visual concepts and have the property of combinatorial explosion. An important reason for humans to efficiently learn from diverse visual scenes is the ability of compositional perception, and it is desirable…

Machine Learning · Computer Science 2023-06-16 Jinyang Yuan , Tonglin Chen , Bin Li , Xiangyang Xue

Learning compositional representation is a key aspect of object-centric learning as it enables flexible systematic generalization and supports complex visual reasoning. However, most of the existing approaches rely on auto-encoding…

Computer Vision and Pattern Recognition · Computer Science 2025-11-11 Whie Jung , Jaehoon Yoo , Sungjin Ahn , Seunghoon Hong

Modeling various aspects that make a music piece unique is a challenging task, requiring the combination of multiple sources of information. Deep learning is commonly used to obtain representations using various sources of information, such…

Sound · Computer Science 2021-04-05 Andres Ferraro , Xavier Favory , Konstantinos Drossos , Yuntae Kim , Dmitry Bogdanov

Many machine learning algorithms represent input data with vector embeddings or discrete codes. When inputs exhibit compositional structure (e.g. objects built from parts or procedures from subroutines), it is natural to ask whether this…

Machine Learning · Computer Science 2019-04-09 Jacob Andreas

Machine hearing of the environmental sound is one of the important issues in the audio recognition domain. It gives the machine the ability to discriminate between the different input sounds that guides its decision making. In this work we…

Sound · Computer Science 2022-07-20 Peter Ochieng , Dennis Kaburu

The objective of this paper is to perform audio-visual sound source separation, i.e.~to separate component audios from a mixture based on the videos of sound sources. Moreover, we aim to pinpoint the source location in the input video…

Computer Vision and Pattern Recognition · Computer Science 2021-04-20 Lingyu Zhu , Esa Rahtu

Machine hearing is an emerging area. Motivated by the need of a principled framework across domain applications for machine listening, we propose a generic and data-driven representation learning approach. For this sake, a novel and…

Sound · Computer Science 2021-01-01 Imad Rida

Visual events are usually accompanied by sounds in our daily lives. We pose the question: Can the machine learn the correspondence between visual scene and the sound, and localize the sound source only by observing sound and visual scene…

Computer Vision and Pattern Recognition · Computer Science 2019-02-18 Arda Senocak , Tae-Hyun Oh , Junsik Kim , Ming-Hsuan Yang , In So Kweon

Machine learning techniques have proved useful for classifying and analyzing audio content. However, recent methods typically rely on abstract and high-dimensional representations that are difficult to interpret. Inspired by…

Autonomous agents need large repertoires of skills to act reasonably on new tasks that they have not seen before. However, acquiring these skills using only a stream of high-dimensional, unstructured, and unlabeled observations is a tricky…

Machine Learning · Computer Science 2021-02-09 Andrii Zadaianchuk , Maximilian Seitzer , Georg Martius

Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its…

Computer Vision and Pattern Recognition · Computer Science 2019-11-22 Arda Senocak , Tae-Hyun Oh , Junsik Kim , Ming-Hsuan Yang , In So Kweon

One of the key limitations of modern deep learning approaches lies in the amount of data required to train them. Humans, by contrast, can learn to recognize novel categories from just a few examples. Instrumental to this rapid learning…

Computer Vision and Pattern Recognition · Computer Science 2019-08-20 Pavel Tokmakov , Yu-Xiong Wang , Martial Hebert

In this work, we study the task of Audio Language Modeling, in which we aim at learning probabilistic models for audio that can be used for generation and completion. We use a state-of-the-art perceptually-guided audio compression model, to…

Conventional audio classification relied on predefined classes, lacking the ability to learn from free-form text. Recent methods unlock learning joint audio-text embeddings from raw audio-text pairs describing audio in natural language.…

Multimedia · Computer Science 2024-01-11 Ali Vosoughi , Luca Bondi , Ho-Hsiang Wu , Chenliang Xu
‹ Prev 1 2 3 10 Next ›