Related papers: EDSep: An Effective Diffusion-Based Method for Spe…

Diffusion-based Generative Speech Source Separation

We propose DiffSep, a new single channel source separation method based on score-matching of a stochastic differential equation (SDE). We craft a tailored continuous time diffusion-mixing process starting from the separated sources and…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-03 Robin Scheibler , Youna Ji , Soo-Whan Chung , Jaeuk Byun , Soyeon Choe , Min-Seok Choi

Diff-VS: Efficient Audio-Aware Diffusion U-Net for Vocals Separation

While diffusion models are best known for their performance in generative tasks, they have also been successfully applied to many other tasks, including audio source separation. However, current generative approaches to music source…

Audio and Speech Processing · Electrical Eng. & Systems 2026-04-24 Yun-Ning , Hung , Richard Vogl , Filip Korzeniowski , Igor Pereira

ECTSpeech: Enhancing Efficient Speech Synthesis via Easy Consistency Tuning

Diffusion models have demonstrated remarkable performance in speech synthesis, but typically require multi-step sampling, resulting in low inference efficiency. Recent studies address this issue by distilling diffusion models into…

Sound · Computer Science 2025-10-08 Tao Zhu , Yinfeng Yu , Liejun Wang , Fuchun Sun , Wendong Zheng

Speech Enhancement and Dereverberation with Diffusion-based Generative Models

In this work, we build upon our previous publication and use diffusion-based generative models for speech enhancement. We present a detailed overview of the diffusion process that is based on a stochastic differential equation and delve…

Audio and Speech Processing · Electrical Eng. & Systems 2025-10-14 Julius Richter , Simon Welker , Jean-Marie Lemercier , Bunlong Lay , Timo Gerkmann

Unsupervised Single-Channel Speech Separation with a Diffusion Prior under Speaker-Embedding Guidance

Speech separation is a fundamental task in audio processing, typically addressed with fully supervised systems trained on paired mixtures. While effective, such systems typically rely on synthetic data pipelines, which may not reflect…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-30 Runwu Shi , Kai Li , Chang Li , Jiang Wang , Sihan Tan , Kazuhiro Nakadai

Noise-robust Speech Separation with Fast Generative Correction

Speech separation, the task of isolating multiple speech sources from a mixed audio signal, remains challenging in noisy environments. In this paper, we propose a generative correction method to enhance the output of a discriminative…

Audio and Speech Processing · Electrical Eng. & Systems 2024-06-12 Helin Wang , Jesus Villalba , Laureano Moro-Velazquez , Jiarui Hai , Thomas Thebaud , Najim Dehak

Single and Few-step Diffusion for Generative Speech Enhancement

Diffusion models have shown promising results in speech enhancement, using a task-adapted diffusion process for the conditional generation of clean speech given a noisy mixture. However, at test time, the neural network used for score…

Audio and Speech Processing · Electrical Eng. & Systems 2024-01-17 Bunlong Lay , Jean-Marie Lemercier , Julius Richter , Timo Gerkmann

Diffusion Normalizing Flow

We present a novel generative modeling method called diffusion normalizing flow based on stochastic differential equations (SDEs). The algorithm consists of two neural SDEs: a forward SDE that gradually adds noise to the data to transform…

Machine Learning · Computer Science 2021-10-15 Qinsheng Zhang , Yongxin Chen

DDTSE: Discriminative Diffusion Model for Target Speech Extraction

Diffusion models have gained attention in speech enhancement tasks, providing an alternative to conventional discriminative methods. However, research on target speech extraction under multi-speaker noisy conditions remains relatively…

Audio and Speech Processing · Electrical Eng. & Systems 2024-10-08 Leying Zhang , Yao Qian , Linfeng Yu , Heming Wang , Hemin Yang , Long Zhou , Shujie Liu , Yanmin Qian

Conditional Diffusion Model for Target Speaker Extraction

We propose DiffSpEx, a generative target speaker extraction method based on score-based generative modelling through stochastic differential equations. DiffSpEx deploys a continuous-time stochastic diffusion process in the complex…

Audio and Speech Processing · Electrical Eng. & Systems 2023-10-10 Theodor Nguyen , Guangzhi Sun , Xianrui Zheng , Chao Zhang , Philip C Woodland

GDiffuSE: Diffusion-based speech enhancement with noise model guidance

This paper introduces a novel speech enhancement (SE) approach based on a denoising diffusion probabilistic model (DDPM), termed Guided diffusion for speech enhancement (GDiffuSE). In contrast to conventional methods that directly map noisy…

Sound · Computer Science 2026-03-03 Efrayim Yanir , David Burshtein , Sharon Gannot

DiffSED: Sound Event Detection with Denoising Diffusion

Sound Event Detection (SED) aims to predict the temporal boundaries of all the events of interest and their class labels, given an unconstrained audio sample. Taking either the splitand-classify (i.e., frame-level) strategy or the more…

Sound · Computer Science 2023-08-21 Swapnil Bhosale , Sauradip Nag , Diptesh Kanojia , Jiankang Deng , Xiatian Zhu

A Tutorial on Diffusion Theory: From Differential Equations to Diffusion Models

Diffusion models have emerged as a dominant framework for generative modeling, but their mathematical foundations are often presented separately through diffusion probabilistic models, score-based modeling, stochastic differential…

Machine Learning · Computer Science 2026-05-29 Jiayi Fu , Yuxia Wang

SeqDiffuSeq: Text Diffusion with Encoder-Decoder Transformers

Diffusion model, a new generative modelling paradigm, has achieved great success in image, audio, and video generation. However, considering the discrete categorical nature of text, it is not trivial to extend continuous diffusion models to…

Computation and Language · Computer Science 2023-05-23 Hongyi Yuan , Zheng Yuan , Chuanqi Tan , Fei Huang , Songfang Huang

Efficient and Fast Generative-Based Singing Voice Separation using a Latent Diffusion Model

Extracting individual elements from music mixtures is a valuable tool for music production and practice. While neural networks optimized to mask or transform mixture spectrograms into the individual source(s) have been the leading approach,…

Sound · Computer Science 2025-11-26 Genís Plaja-Roglans , Yun-Ning Hung , Xavier Serra , Igor Pereira

SEED: Speaker Embedding Enhancement Diffusion Model

A primary challenge when deploying speaker recognition systems in real-world applications is performance degradation caused by environmental mismatch. We propose a diffusion-based method that takes speaker embeddings extracted from a…

Audio and Speech Processing · Electrical Eng. & Systems 2025-05-23 KiHyun Nam , Jungwoo Heo , Jee-weon Jung , Gangin Park , Chaeyoung Jung , Ha-Jin Yu , Joon Son Chung

Denoising Score Distillation: From Noisy Diffusion Pretraining to One-Step High-Quality Generation

Diffusion models have achieved remarkable success in generating high-resolution, realistic images across diverse natural distributions. However, their performance heavily relies on high-quality training data, making it challenging to learn…

Machine Learning · Computer Science 2025-05-22 Tianyu Chen , Yasi Zhang , Zhendong Wang , Ying Nian Wu , Oscar Leong , Mingyuan Zhou

Improving Voice Separation by Incorporating End-to-end Speech Recognition

Despite recent advances in voice separation methods, many challenges remain in realistic scenarios such as noisy recording and the limits of available data. In this work, we propose to explicitly incorporate the phonetic and linguistic…

Sound · Computer Science 2020-05-05 Naoya Takahashi , Mayank Kumar Singh , Sakya Basak , Parthasaarathy Sudarsanam , Sriram Ganapathy , Yuki Mitsufuji

Empowering Diffusion Models on the Embedding Space for Text Generation

Diffusion models have achieved state-of-the-art synthesis quality on both visual and audio tasks, and recent works further adapt them to textual data by diffusing on the embedding space. In this paper, we conduct systematic studies of the…

Computation and Language · Computer Science 2024-04-23 Zhujin Gao , Junliang Guo , Xu Tan , Yongxin Zhu , Fang Zhang , Jiang Bian , Linli Xu

MeanFlow-TSE: One-Step Generative Target Speaker Extraction with Mean Flow

Target speaker extraction (TSE) aims to isolate a desired speaker's voice from a multi-speaker mixture using auxiliary information such as a reference utterance. Although recent advances in diffusion and flow-matching models have improved…

Audio and Speech Processing · Electrical Eng. & Systems 2025-12-23 Riki Shimizu , Xilin Jiang , Nima Mesgarani