English

Diffusion-based Frameworks for Unsupervised Speech Enhancement

Sound 2026-05-26 v4

Abstract

This paper addresses unsupervised diffusion-based single-channel speech enhancement (SE). Prior work in this direction combines a score-based diffusion model trained on clean speech with a Gaussian noise model whose covariance is structured by non-negative matrix factorization (NMF). This combination is used within an iterative expectation-maximization (EM) scheme, in which a diffusion-based posterior-sampling E-step estimates the clean speech. We first revisit this framework and propose to explicitly model both speech and acoustic noise as latent variables, jointly sampling them in the E-step instead of sampling speech alone as in previous approaches. We then introduce a new semi-supervised SE framework that replaces the NMF noise prior with a diffusion-based noise model, learned jointly with the speech prior in a single conditional score model. Within this framework, we derive two variants: one that implicitly accounts for noise and one that explicitly treats noise as a latent variable. Experiments on WSJ0-QUT and VoiceBank-DEMAND show that explicit noise modeling systematically improves SE performance for both NMF-based and diffusion-based noise priors. Under matched conditions, the diffusion-based noise model attains the best overall quality and intelligibility among unsupervised methods, while under mismatched conditions the proposed NMF-based explicit-noise framework is more robust and suffers less degradation than several supervised baselines. Code, demo, and supplementary materials are publicly available.

Keywords

Cite

@article{arxiv.2601.09931,
  title  = {Diffusion-based Frameworks for Unsupervised Speech Enhancement},
  author = {Jean-Eudes Ayilo and Mostafa Sadeghi and Romain Serizel and Xavier Alameda-Pineda},
  journal= {arXiv preprint arXiv:2601.09931},
  year   = {2026}
}