English

Conditional Diffusion Model for Target Speaker Extraction

Audio and Speech Processing 2023-10-10 v1 Machine Learning Sound

Abstract

We propose DiffSpEx, a generative target speaker extraction method based on score-based generative modelling through stochastic differential equations. DiffSpEx deploys a continuous-time stochastic diffusion process in the complex short-time Fourier transform domain, starting from the target speaker source and converging to a Gaussian distribution centred on the mixture of sources. For the reverse-time process, a parametrised score function is conditioned on a target speaker embedding to extract the target speaker from the mixture of sources. We utilise ECAPA-TDNN target speaker embeddings and condition the score function alternately on the SDE time embedding and the target speaker embedding. The potential of DiffSpEx is demonstrated with the WSJ0-2mix dataset, achieving an SI-SDR of 12.9 dB and a NISQA score of 3.56. Moreover, we show that fine-tuning a pre-trained DiffSpEx model to a specific speaker further improves performance, enabling personalisation in target speaker extraction.

Keywords

Cite

@article{arxiv.2310.04791,
  title  = {Conditional Diffusion Model for Target Speaker Extraction},
  author = {Theodor Nguyen and Guangzhi Sun and Xianrui Zheng and Chao Zhang and Philip C Woodland},
  journal= {arXiv preprint arXiv:2310.04791},
  year   = {2023}
}

Comments

5 pages, 4 figures, submitted to ICASSP 2024

R2 v1 2026-06-28T12:43:21.884Z