Related papers: Unsupervised Composable Representations for Audio

Unsupervised Music Source Separation Using Differentiable Parametric Source Models

Supervised deep learning approaches to underdetermined audio source separation achieve state-of-the-art performance but require a dataset of mixtures along with their corresponding isolated source signals. Such datasets can be extremely…

Sound · Computer Science 2023-02-01 Kilian Schulze-Forster , Gaël Richard , Liam Kelley , Clement S. J. Doire , Roland Badeau

Compositional Audio Representation Learning

Human auditory perception is compositional in nature -- we identify auditory streams from auditory scenes with multiple sound events. However, such auditory scenes are typically represented using clip-level representations that do not…

Sound · Computer Science 2025-03-04 Sripathi Sridhar , Mark Cartwright

Unsupervised Audio Source Separation using Generative Priors

State-of-the-art under-determined audio source separation systems rely on supervised end-end training of carefully tailored neural network architectures operating either in the time or the spectral domain. However, these methods are…

Audio and Speech Processing · Electrical Eng. & Systems 2020-05-29 Vivek Narayanaswamy , Jayaraman J. Thiagarajan , Rushil Anirudh , Andreas Spanias

Efficient and Fast Generative-Based Singing Voice Separation using a Latent Diffusion Model

Extracting individual elements from music mixtures is a valuable tool for music production and practice. While neural networks optimized to mask or transform mixture spectrograms into the individual source(s) have been the leading approach,…

Sound · Computer Science 2025-11-26 Genís Plaja-Roglans , Yun-Ning Hung , Xavier Serra , Igor Pereira

Unsupervised Interpretable Representation Learning for Singing Voice Separation

In this work, we present a method for learning interpretable music signal representations directly from waveform signals. Our method can be trained using unsupervised objectives and relies on the denoising auto-encoder model that uses a…

Audio and Speech Processing · Electrical Eng. & Systems 2020-07-02 Stylianos I. Mimilakis , Konstantinos Drossos , Gerald Schuller

Diffusion-Based Unsupervised Audio-Visual Speech Separation in Noisy Environments with Noise Prior

In this paper, we address the problem of single-microphone speech separation in the presence of ambient noise. We propose a generative unsupervised technique that directly models both clean speech and structured noise components, training…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-19 Yochai Yemini , Rami Ben-Ari , Sharon Gannot , Ethan Fetaya

Audio Generation Through Score-Based Generative Modeling: Design Principles and Implementation

Diffusion models have emerged as powerful deep generative techniques, producing high-quality and diverse samples in applications in various domains including audio. While existing reviews provide overviews, there remains limited in-depth…

Sound · Computer Science 2026-01-16 Ge Zhu , Yutong Wen , Zhiyao Duan

Controllable Music Production with Diffusion Models and Guidance Gradients

We demonstrate how conditional generation from diffusion models can be used to tackle a variety of realistic tasks in the production of music in 44.1kHz stereo audio with sampling-time guidance. The scenarios we consider include…

Sound · Computer Science 2023-12-06 Mark Levy , Bruno Di Giorgi , Floris Weers , Angelos Katharopoulos , Tom Nickson

Generating Separated Singing Vocals Using a Diffusion Model Conditioned on Music Mixtures

Separating the individual elements in a musical mixture is an essential process for music analysis and practice. While this is generally addressed using neural networks optimized to mask or transform the time-frequency representation of a…

Sound · Computer Science 2025-11-27 Genís Plaja-Roglans , Yun-Ning Hung , Xavier Serra , Igor Pereira

A Framework for Generative and Contrastive Learning of Audio Representations

In this paper, we present a framework for contrastive learning for audio representations, in a self supervised frame work without access to any ground truth labels. The core idea in self supervised contrastive learning is to map an audio…

Sound · Computer Science 2021-03-18 Prateek Verma , Julius Smith

High-Quality Sound Separation Across Diverse Categories via Visually-Guided Generative Modeling

We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning. Existing methods typically frame sound separation as a mask-based regression…

Computer Vision and Pattern Recognition · Computer Science 2025-09-29 Chao Huang , Susan Liang , Yapeng Tian , Anurag Kumar , Chenliang Xu

Learning Representations for New Sound Classes With Continual Self-Supervised Learning

In this paper, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framework where the model can be updated without relying on labeled data. For this purpose, we propose…

Audio and Speech Processing · Electrical Eng. & Systems 2023-01-11 Zhepei Wang , Cem Subakan , Xilin Jiang , Junkai Wu , Efthymios Tzinis , Mirco Ravanelli , Paris Smaragdis

Progressive distillation diffusion for raw music generation

This paper aims to apply a new deep learning approach to the task of generating raw audio files. It is based on diffusion models, a recent type of deep generative model. This new type of method has recently shown outstanding results with…

Sound · Computer Science 2023-07-21 Svetlana Pavlova

Unsupervised Single-Channel Speech Separation with a Diffusion Prior under Speaker-Embedding Guidance

Speech separation is a fundamental task in audio processing, typically addressed with fully supervised systems trained on paired mixtures. While effective, such systems typically rely on synthetic data pipelines, which may not reflect…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-30 Runwu Shi , Kai Li , Chang Li , Jiang Wang , Sihan Tan , Kazuhiro Nakadai

Unsupervised Single-Channel Audio Separation with Diffusion Source Priors

Single-channel audio separation aims to separate individual sources from a single-channel mixture. Most existing methods rely on supervised learning with synthetically generated paired data. However, obtaining high-quality paired data in…

Audio and Speech Processing · Electrical Eng. & Systems 2025-12-24 Runwu Shi , Chang Li , Jiang Wang , Rui Zhang , Nabeela Khan , Benjamin Yen , Takeshi Ashizawa , Kazuhiro Nakadai

Unsupervised Estimation of Nonlinear Audio Effects: Comparing Diffusion-Based and Adversarial approaches

Accurately estimating nonlinear audio effects without access to paired input-output signals remains a challenging problem. This work studies unsupervised probabilistic approaches for solving this task. We introduce a method, novel for this…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-25 Eloi Moliner , Michal Švento , Alec Wright , Lauri Juvela , Pavel Rajmic , Vesa Välimäki

High-Quality Visually-Guided Sound Separation from Diverse Categories

We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning. Existing methods typically frame sound separation as a mask-based regression…

Computer Vision and Pattern Recognition · Computer Science 2024-10-14 Chao Huang , Susan Liang , Yapeng Tian , Anurag Kumar , Chenliang Xu

Self-Supervised Learning from Automatically Separated Sound Scenes

Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and…

Sound · Computer Science 2021-09-16 Eduardo Fonseca , Aren Jansen , Daniel P. W. Ellis , Scott Wisdom , Marco Tagliasacchi , John R. Hershey , Manoj Plakal , Shawn Hershey , R. Channing Moore , Xavier Serra

Evaluating Disentangled Representations for Controllable Music Generation

Recent approaches in music generation rely on disentangled representations, often labeled as structure and timbre or local and global, to enable controllable synthesis. Yet the underlying properties of these embeddings remain underexplored.…

Sound · Computer Science 2026-02-17 Laura Ibáñez-Martínez , Chukwuemeka Nkama , Andrea Poltronieri , Xavier Serra , Martín Rocamora

Unsupervised Source Separation By Steering Pretrained Music Models

We showcase an unsupervised method that repurposes deep models trained for music generation and music tagging for audio source separation, without any retraining. An audio generation model is conditioned on an input mixture, producing a…

Sound · Computer Science 2021-10-26 Ethan Manilow , Patrick O'Reilly , Prem Seetharaman , Bryan Pardo