Related papers: Conditional Diffusion Model for Target Speaker Ext…

Diffusion-based Generative Speech Source Separation

We propose DiffSep, a new single channel source separation method based on score-matching of a stochastic differential equation (SDE). We craft a tailored continuous time diffusion-mixing process starting from the separated sources and…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-03 Robin Scheibler , Youna Ji , Soo-Whan Chung , Jaeuk Byun , Soyeon Choe , Min-Seok Choi

Target Speech Extraction with Conditional Diffusion Model

Diffusion model-based speech enhancement has received increased attention since it can generate very natural enhanced signals and generalizes well to unseen conditions. Diffusion models have been explored for several sub-tasks of speech…

Audio and Speech Processing · Electrical Eng. & Systems 2023-08-21 Naoyuki Kamo , Marc Delcroix , Tomohiro Nakatani

EDSep: An Effective Diffusion-Based Method for Speech Source Separation

Generative models have attracted considerable attention for speech separation tasks, and among these, diffusion-based methods are being explored. Despite the notable success of diffusion techniques in generation tasks, their adaptation to…

Audio and Speech Processing · Electrical Eng. & Systems 2025-01-28 Jinwei Dong , Xinsheng Wang , Qirong Mao

Informed Source Extraction With Application to Acoustic Echo Reduction

Informed speaker extraction aims to extract a target speech signal from a mixture of sources given prior knowledge about the desired speaker. Recent deep learning-based methods leverage a speaker discriminative model that maps a reference…

Audio and Speech Processing · Electrical Eng. & Systems 2022-02-17 Mohamed Elminshawi , Wolfgang Mack , Emanuël A. P. Habets

Speech Enhancement and Dereverberation with Diffusion-based Generative Models

In this work, we build upon our previous publication and use diffusion-based generative models for speech enhancement. We present a detailed overview of the diffusion process that is based on a stochastic differential equation and delve…

Audio and Speech Processing · Electrical Eng. & Systems 2025-10-14 Julius Richter , Simon Welker , Jean-Marie Lemercier , Bunlong Lay , Timo Gerkmann

SpEx: Multi-Scale Time Domain Speaker Extraction Network

Speaker extraction aims to mimic humans' selective auditory attention by extracting a target speaker's voice from a multi-talker environment. It is common to perform the extraction in frequency-domain, and reconstruct the time-domain signal…

Audio and Speech Processing · Electrical Eng. & Systems 2020-04-20 Chenglin Xu , Wei Rao , Eng Siong Chng , Haizhou Li

DDTSE: Discriminative Diffusion Model for Target Speech Extraction

Diffusion models have gained attention in speech enhancement tasks, providing an alternative to conventional discriminative methods. However, research on target speech extraction under multi-speaker noisy conditions remains relatively…

Audio and Speech Processing · Electrical Eng. & Systems 2024-10-08 Leying Zhang , Yao Qian , Linfeng Yu , Heming Wang , Hemin Yang , Long Zhou , Shujie Liu , Yanmin Qian

Unsupervised speech enhancement with diffusion-based generative models

Recently, conditional score-based diffusion models have gained significant attention in the field of supervised speech enhancement, yielding state-of-the-art performance. However, these methods may face challenges when generalising to…

Computer Vision and Pattern Recognition · Computer Science 2023-09-20 Berné Nortier , Mostafa Sadeghi , Romain Serizel

SpEx+: A Complete Time Domain Speaker Extraction Network

Speaker extraction aims to extract the target speech signal from a multi-talker environment given a target speaker's reference speech. We recently proposed a time-domain solution, SpEx, that avoids the phase estimation in frequency-domain…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-19 Meng Ge , Chenglin Xu , Longbiao Wang , Eng Siong Chng , Jianwu Dang , Haizhou Li

L-SpEx: Localized Target Speaker Extraction

Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance. Recent studies show that speaker extraction benefits from the location or direction of the target…

Audio and Speech Processing · Electrical Eng. & Systems 2022-02-22 Meng Ge , Chenglin Xu , Longbiao Wang , Eng Siong Chng , Jianwu Dang , Haizhou Li

Variance-Reduced Diffusion Sampling via Target Score Identity

We study variance reduction for score estimation and diffusion-based sampling in settings where the clean (target) score is available or can be approximated. Starting from the Target Score Identity (TSI), which expresses the noisy marginal…

Machine Learning · Statistics 2026-01-26 Alois Duston , Tan Bui-Thanh

Audio-Visual Speech Enhancement with Score-Based Generative Models

This paper introduces an audio-visual speech enhancement system that leverages score-based generative models, also known as diffusion models, conditioned on visual information. In particular, we exploit audio-visual embeddings obtained from…

Audio and Speech Processing · Electrical Eng. & Systems 2023-06-05 Julius Richter , Simone Frintrop , Timo Gerkmann

Enhancing Target Speaker Extraction with Explicit Speaker Consistency Modeling

Target Speaker Extraction (TSE) uses a reference cue to extract the target speech from a mixture. In TSE systems relying on audio cues, the speaker embedding from the enrolled speech is crucial to performance. However, these embeddings may…

Sound · Computer Science 2025-08-12 Shu Wu , Anbin Qi , Yanzhang Xie , Xiang Xie

Unsupervised Single-Channel Speech Separation with a Diffusion Prior under Speaker-Embedding Guidance

Speech separation is a fundamental task in audio processing, typically addressed with fully supervised systems trained on paired mixtures. While effective, such systems typically rely on synthetic data pipelines, which may not reflect…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-30 Runwu Shi , Kai Li , Chang Li , Jiang Wang , Sihan Tan , Kazuhiro Nakadai

Generating Separated Singing Vocals Using a Diffusion Model Conditioned on Music Mixtures

Separating the individual elements in a musical mixture is an essential process for music analysis and practice. While this is generally addressed using neural networks optimized to mask or transform the time-frequency representation of a…

Sound · Computer Science 2025-11-27 Genís Plaja-Roglans , Yun-Ning Hung , Xavier Serra , Igor Pereira

SEED: Speaker Embedding Enhancement Diffusion Model

A primary challenge when deploying speaker recognition systems in real-world applications is performance degradation caused by environmental mismatch. We propose a diffusion-based method that takes speaker embeddings extracted from a…

Audio and Speech Processing · Electrical Eng. & Systems 2025-05-23 KiHyun Nam , Jungwoo Heo , Jee-weon Jung , Gangin Park , Chaeyoung Jung , Ha-Jin Yu , Joon Son Chung

Extract and Diffuse: Latent Integration for Improved Diffusion-based Speech and Vocal Enhancement

Diffusion-based generative models have recently achieved remarkable results in speech and vocal enhancement due to their ability to model complex speech data distributions. While these models generalize well to unseen acoustic environments,…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-23 Yudong Yang , Zhan Liu , Wenyi Yu , Guangzhi Sun , Qiuqiang Kong , Chao Zhang

Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain

Score-based generative models (SGMs) have recently shown impressive results for difficult generative tasks such as the unconditional and conditional generation of natural images and audio signals. In this work, we extend these models to the…

Audio and Speech Processing · Electrical Eng. & Systems 2022-07-08 Simon Welker , Julius Richter , Timo Gerkmann

Audio Generation Through Score-Based Generative Modeling: Design Principles and Implementation

Diffusion models have emerged as powerful deep generative techniques, producing high-quality and diverse samples in applications in various domains including audio. While existing reviews provide overviews, there remains limited in-depth…

Sound · Computer Science 2026-01-16 Ge Zhu , Yutong Wen , Zhiyao Duan

Score-based Continuous-time Discrete Diffusion Models

Score-based modeling through stochastic differential equations (SDEs) has provided a new perspective on diffusion models, and demonstrated superior performance on continuous data. However, the gradient of the log-likelihood function, i.e.,…

Machine Learning · Computer Science 2023-03-07 Haoran Sun , Lijun Yu , Bo Dai , Dale Schuurmans , Hanjun Dai