English

DiffWave: A Versatile Diffusion Model for Audio Synthesis

Audio and Speech Processing 2021-04-01 v3 Computation and Language Machine Learning Sound Machine Learning

Abstract

In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.

Keywords

Cite

@article{arxiv.2009.09761,
  title  = {DiffWave: A Versatile Diffusion Model for Audio Synthesis},
  author = {Zhifeng Kong and Wei Ping and Jiaji Huang and Kexin Zhao and Bryan Catanzaro},
  journal= {arXiv preprint arXiv:2009.09761},
  year   = {2021}
}

Comments

ICLR 2021 (oral)

R2 v1 2026-06-23T18:41:07.153Z