Related papers: Zero-shot Voice Conversion with Diffusion Transfor…

CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

Zero-shot voice conversion (VC) aims to convert the original speaker's timbre to any target speaker while keeping the linguistic content. Current mainstream zero-shot voice conversion approaches depend on pre-trained recognition models to…

Sound · Computer Science 2024-12-04 Yuke Li , Xinfa Zhu , Hanzhao Li , JiXun Yao , WenJie Tian , XiPeng Yang , YunLin Chen , Zhifei Li , Lei Xie

Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching

Zero-Shot Voice Conversion (VC) aims to transform the source speaker's timbre into an arbitrary unseen one while retaining speech content. Most prior work focuses on preserving the source's prosody, while fine-grained timbre information may…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-03 Jialong Zuo , Shengpeng Ji , Minghui Fang , Mingze Li , Ziyue Jiang , Xize Cheng , Xiaoda Yang , Chen Feiyang , Xinyu Duan , Zhou Zhao

SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross Attention

Zero-shot voice conversion (VC) aims to transfer the source speaker timbre to arbitrary unseen target speaker timbre, while keeping the linguistic content unchanged. Although the voice of generated speech can be controlled by providing the…

Sound · Computer Science 2024-01-31 Junjie Li , Yiwei Guo , Xie Chen , Kai Yu

Improvement Speaker Similarity for Zero-Shot Any-to-Any Voice Conversion of Whispered and Regular Speech

Zero-shot voice conversion aims to transfer the voice of a source speaker to that of a speaker unseen during training, while preserving the content information. Although various methods have been proposed to reconstruct speaker information…

Sound · Computer Science 2024-08-22 Anastasia Avdeeva , Aleksei Gusev

StableVC: Style Controllable Zero-Shot Voice Conversion with Conditional Flow Matching

Zero-shot voice conversion (VC) aims to transfer the timbre from the source speaker to an arbitrary unseen speaker while preserving the original linguistic content. Despite recent advancements in zero-shot VC using language model-based or…

Audio and Speech Processing · Electrical Eng. & Systems 2024-12-11 Jixun Yao , Yuguang Yang , Yu Pan , Ziqian Ning , Jiaohao Ye , Hongbin Zhou , Lei Xie

Takin-VC: Expressive Zero-Shot Voice Conversion via Adaptive Hybrid Content Encoding and Enhanced Timbre Modeling

Expressive zero-shot voice conversion (VC) is a critical and challenging task that aims to transform the source timbre into an arbitrary unseen speaker while preserving the original content and expressive qualities. Despite recent progress…

Sound · Computer Science 2025-01-13 Yuguang Yang , Yu Pan , Jixun Yao , Xiang Zhang , Jianhao Ye , Hongbin Zhou , Lei Xie , Lei Ma , Jianjun Zhao

SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment

Zero-shot voice conversion (VC) synthesizes speech in a target speaker's voice while preserving linguistic and paralinguistic content. However, timbre leakage-where source speaker traits persist-remains a challenge, especially in neural…

Audio and Speech Processing · Electrical Eng. & Systems 2025-07-15 Shivam Mehta , Yingru Liu , Zhenyu Tang , Kainan Peng , Vimal Manohar , Shun Zhang , Mike Seltzer , Qing He , Mingbo Ma

Zero-Shot Voice Conversion via Content-Aware Timbre Ensemble and Conditional Flow Matching

Despite recent advances in zero-shot voice conversion (VC), achieving speaker similarity and naturalness comparable to ground-truth recordings remains a significant challenge. In this letter, we propose CTEFM-VC, a zero-shot VC framework…

Sound · Computer Science 2025-08-12 Yu Pan , Yuguang Yang , Jixun Yao , Lei Ma , Jianjun Zhao

Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations

This study presents an innovative Zero-Shot any-to-any Singing Voice Conversion (SVC) method, leveraging a novel clustering-based phoneme representation to effectively separate content, timbre, and singing style. This approach enables…

Sound · Computer Science 2024-10-15 Wangjin Zhou , Fengrun Zhang , Yiming Liu , Wenhao Guan , Yi Zhao , Tatsuya Kawahara

MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows

Zero-shot voice conversion (VC) aims to transfer timbre from a source speaker to any unseen target speaker while preserving linguistic content. Growing application scenarios demand models with streaming inference capabilities. This has…

Audio and Speech Processing · Electrical Eng. & Systems 2025-12-23 Guobin Ma , Jixun Yao , Ziqian Ning , Yuepeng Jiang , Lingxin Xiong , Lei Xie , Pengcheng Zhu

YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases

Singing voice conversion (SVC) aims to render the target singer's timbre while preserving melody and lyrics. However, existing zero-shot SVC systems remain fragile in real songs due to harmony interference, F0 errors, and the lack of…

Sound · Computer Science 2025-12-05 Gongyu Chen , Xiaoyu Zhang , Zhenqiang Weng , Junjie Zheng , Da Shen , Chaofan Ding , Wei-Qiang Zhang , Zihao Chen

HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios

Zero-shot singing voice conversion (SVC) transforms a source singer's timbre to an unseen target speaker's voice while preserving melodic content without fine-tuning. Existing methods model speaker timbre and vocal content separately,…

Sound · Computer Science 2025-11-18 Bingsong Bai , Yizhong Geng , Fengping Wang , Cong Wang , Puyuan Guo , Yingming Gao , Ya Li

REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers

In real-world voice conversion applications, environmental noise in source speech and user demands for expressive output pose critical challenges. Traditional ASR-based methods ensure noise robustness but suppress prosody richness, while…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-11 Yuepeng Jiang , Ziqian Ning , Shuai Wang , Chengjia Wang , Mengxiao Bi , Pengcheng Zhu , Zhonghua Fu , Lei Xie

X-VC: Zero-shot Streaming Voice Conversion in Codec Space

Zero-shot voice conversion (VC) aims to convert a source utterance into the voice of an unseen target speaker while preserving its linguistic content. Although recent systems have improved conversion quality, building zero-shot VC systems…

Audio and Speech Processing · Electrical Eng. & Systems 2026-04-23 Qixi Zheng , Yuxiang Zhao , Tianrui Wang , Wenxi Chen , Kele Xu , Yikang Li , Qinyuan Chen , Xipeng Qiu , Kai Yu , Xie Chen

EZ-VC: Easy Zero-shot Any-to-Any Voice Conversion

Voice Conversion research in recent times has increasingly focused on improving the zero-shot capabilities of existing methods. Despite remarkable advancements, current architectures still tend to struggle in zero-shot cross-lingual…

Sound · Computer Science 2025-05-26 Advait Joglekar , Divyanshu Singh , Rooshil Rohit Bhatia , S. Umesh

SIG-VC: A Speaker Information Guided Zero-shot Voice Conversion System for Both Human Beings and Machines

Nowadays, as more and more systems achieve good performance in traditional voice conversion (VC) tasks, people's attention gradually turns to VC tasks under extreme conditions. In this paper, we propose a novel method for zero-shot voice…

Sound · Computer Science 2023-04-04 Haozhe Zhang , Zexin Cai , Xiaoyi Qin , Ming Li

AdaptVC: High Quality Voice Conversion with Adaptive Learning

The goal of voice conversion is to transform the speech of a source speaker to sound like that of a reference speaker while preserving the original content. A key challenge is to extract disentangled linguistic content from the source and…

Sound · Computer Science 2025-01-15 Jaehun Kim , Ji-Hoon Kim , Yeunju Choi , Tan Dat Nguyen , Seongkyu Mun , Joon Son Chung

Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion

Traditional studies on voice conversion (VC) have made progress with parallel training data and known speakers. Good voice conversion quality is obtained by exploring better alignment modules or expressive mapping functions. In this study,…

Audio and Speech Processing · Electrical Eng. & Systems 2022-04-01 Jiachen Lian , Chunlei Zhang , Dong Yu

ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training

Style voice conversion aims to transform the speaking style of source speech into a desired style while keeping the original speaker's identity. However, previous style voice conversion approaches primarily focus on well-defined domains…

Audio and Speech Processing · Electrical Eng. & Systems 2025-01-09 Xinfa Zhu , Lei He , Yujia Xiao , Xi Wang , Xu Tan , Sheng Zhao , Lei Xie

Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation

Although voice conversion (VC) systems have shown a remarkable ability to transfer voice style, existing methods still have an inaccurate pitch and low speaker adaptation quality. To address these challenges, we introduce Diff-HierVC, a…

Audio and Speech Processing · Electrical Eng. & Systems 2023-11-09 Ha-Yeong Choi , Sang-Hoon Lee , Seong-Whan Lee