GenVC: Self-Supervised Zero-Shot Voice Conversion

Zexin Cai; Henry Li Xinyuan; Ashi Garg; Leibny Paola García-Perera; Kevin Duh; Sanjeev Khudanpur; Matthew Wiesner; Nicholas Andrews

GenVC: Self-Supervised Zero-Shot Voice Conversion

Audio and Speech Processing 2025-08-21 v2 Machine Learning

Authors: Zexin Cai , Henry Li Xinyuan , Ashi Garg , Leibny Paola García-Perera , Kevin Duh , Sanjeev Khudanpur , Matthew Wiesner , Nicholas Andrews

View on arXiv ↗ PDF ↗

Abstract

Most current zero-shot voice conversion methods rely on externally supervised components, particularly speaker encoders, for training. To explore alternatives that eliminate this dependency, this paper introduces GenVC, a novel framework that disentangles speaker identity and linguistic content from speech signals in a self-supervised manner. GenVC leverages speech tokenizers and an autoregressive, Transformer-based language model as its backbone for speech generation. This design supports large-scale training while enhancing both source speaker privacy protection and target speaker cloning fidelity. Experimental results demonstrate that GenVC achieves notably higher speaker similarity, with naturalness on par with leading zero-shot approaches. Moreover, due to its autoregressive formulation, GenVC introduces flexibility in temporal alignment, reducing the preservation of source prosody and speaker-specific traits, and making it highly effective for voice anonymization.

Keywords

speech processing generative adversarial networks for speech speaker recognition and verification

Cite

@article{arxiv.2502.04519,
  title  = {GenVC: Self-Supervised Zero-Shot Voice Conversion},
  author = {Zexin Cai and Henry Li Xinyuan and Ashi Garg and Leibny Paola García-Perera and Kevin Duh and Sanjeev Khudanpur and Matthew Wiesner and Nicholas Andrews},
  journal= {arXiv preprint arXiv:2502.04519},
  year   = {2025}
}

Comments

accepted by 2025 IEEE ASRU

GenVC: Self-Supervised Zero-Shot Voice Conversion

Abstract

Keywords

Cite

Comments

Related papers