English

Efficient Parallel Audio Generation using Group Masked Language Modeling

Audio and Speech Processing 2024-10-28 v1 Artificial Intelligence Machine Learning

Abstract

We present a fast and high-quality codec language model for parallel audio generation. While SoundStorm, a state-of-the-art parallel audio generation model, accelerates inference speed compared to autoregressive models, it still suffers from slow inference due to iterative sampling. To resolve this problem, we propose Group-Masked Language Modeling~(G-MLM) and Group Iterative Parallel Decoding~(G-IPD) for efficient parallel audio generation. Both the training and sampling schemes enable the model to synthesize high-quality audio with a small number of iterations by effectively modeling the group-wise conditional dependencies. In addition, our model employs a cross-attention-based architecture to capture the speaker style of the prompt voice and improves computational efficiency. Experimental results demonstrate that our proposed model outperforms the baselines in prompt-based audio generation.

Keywords

Cite

@article{arxiv.2401.01099,
  title  = {Efficient Parallel Audio Generation using Group Masked Language Modeling},
  author = {Myeonghun Jeong and Minchan Kim and Joun Yeop Lee and Nam Soo Kim},
  journal= {arXiv preprint arXiv:2401.01099},
  year   = {2024}
}

Comments

This work has been submitted to the IEEE for possible publication

R2 v1 2026-06-28T14:06:42.106Z