Related papers: SelfVC: Voice Conversion With Iterative Refinement…

Iteratively Improving Speech Recognition and Voice Conversion

Many existing works on voice conversion (VC) tasks use automatic speech recognition (ASR) models for ensuring linguistic consistency between source and converted samples. However, for the low-data resource domains, training a high-quality…

Sound · Computer Science 2023-05-25 Mayank Kumar Singh , Naoya Takahashi , Onoe Naoyuki

Transferring Source Style in Non-Parallel Voice Conversion

Voice conversion (VC) techniques aim to modify speaker identity of an utterance while preserving the underlying linguistic information. Most VC approaches ignore modeling of the speaking style (e.g. emotion and emphasis), which may contain…

Audio and Speech Processing · Electrical Eng. & Systems 2020-05-20 Songxiang Liu , Yuewen Cao , Shiyin Kang , Na Hu , Xunying Liu , Dan Su , Dong Yu , Helen Meng

Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion

Traditional studies on voice conversion (VC) have made progress with parallel training data and known speakers. Good voice conversion quality is obtained by exploring better alignment modules or expressive mapping functions. In this study,…

Audio and Speech Processing · Electrical Eng. & Systems 2022-04-01 Jiachen Lian , Chunlei Zhang , Dong Yu

SIG-VC: A Speaker Information Guided Zero-shot Voice Conversion System for Both Human Beings and Machines

Nowadays, as more and more systems achieve good performance in traditional voice conversion (VC) tasks, people's attention gradually turns to VC tasks under extreme conditions. In this paper, we propose a novel method for zero-shot voice…

Sound · Computer Science 2023-04-04 Haozhe Zhang , Zexin Cai , Xiaoyi Qin , Ming Li

TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training

Non-parallel many-to-many voice conversion remains an interesting but challenging speech processing task. Recently, AutoVC, a conditional autoencoder based method, achieved excellent conversion results by disentangling the speaker identity…

Sound · Computer Science 2022-08-09 Huaizhen Tang , Xulong Zhang , Jianzong Wang , Ning Cheng , Zhen Zeng , Edward Xiao , Jing Xiao

ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers

Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus, and then applying the learned representations to downstream tasks. Since the majority of the downstream tasks…

Sound · Computer Science 2022-06-27 Kaizhi Qian , Yang Zhang , Heting Gao , Junrui Ni , Cheng-I Lai , David Cox , Mark Hasegawa-Johnson , Shiyu Chang

Self-Supervised Representations for Singing Voice Conversion

A singing voice conversion model converts a song in the voice of an arbitrary source singer to the voice of a target singer. Recently, methods that leverage self-supervised audio representations such as HuBERT and Wav2Vec 2.0 have helped…

Audio and Speech Processing · Electrical Eng. & Systems 2023-03-23 Tejas Jayashankar , Jilong Wu , Leda Sari , David Kant , Vimal Manohar , Qing He

Disentangling the Prosody and Semantic Information with Pre-trained Model for In-Context Learning based Zero-Shot Voice Conversion

Voice conversion (VC) aims to modify the speaker's timbre while retaining speech content. Previous approaches have tokenized the outputs from self-supervised into semantic tokens, facilitating disentanglement of speech content information.…

Sound · Computer Science 2024-09-11 Zhengyang Chen , Shuai Wang , Mingyang Zhang , Xuechen Liu , Junichi Yamagishi , Yanmin Qian

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

Recently, voice conversion (VC) without parallel data has been successfully adapted to multi-target scenario in which a single model is trained to convert the input voice to many different speakers. However, such model suffers from the…

Machine Learning · Computer Science 2019-08-23 Ju-chieh Chou , Cheng-chieh Yeh , Hung-yi Lee

Real-Time and Accurate: Zero-shot High-Fidelity Singing Voice Conversion with Multi-Condition Flow Synthesis

Singing voice conversion is to convert the source singing voice into the target singing voice except for the content. Currently, flow-based models can complete the task of voice conversion, but they struggle to effectively extract latent…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-10 Hui Li , Hongyu Wang , Zhijin Chen , Bohan Sun , Bo Li

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion

One-shot voice conversion (VC), which performs conversion across arbitrary speakers with only a single target-speaker utterance for reference, can be effectively achieved by speech representation disentanglement. Existing work generally…

Audio and Speech Processing · Electrical Eng. & Systems 2021-07-22 Disong Wang , Liqun Deng , Yu Ting Yeung , Xiao Chen , Xunying Liu , Helen Meng

REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion

Speech time reversal refers to the process of reversing the entire speech signal in time, causing it to play backward. Such signals are completely unintelligible since the fundamental structures of phonemes and syllables are destroyed.…

Audio and Speech Processing · Electrical Eng. & Systems 2025-10-02 Ishan D. Biyani , Nirmesh J. Shah , Ashishkumar P. Gudmalwar , Pankaj Wasnik , Rajiv R. Shah

Investigation into Target Speaking Rate Adaptation for Voice Conversion

Disentangling speaker and content attributes of a speech signal into separate latent representations followed by decoding the content with an exchanged speaker representation is a popular approach for voice conversion, which can be trained…

Audio and Speech Processing · Electrical Eng. & Systems 2022-09-07 Michael Kuhlmann , Fritz Seebauer , Janek Ebbers , Petra Wagner , Reinhold Haeb-Umbach

Voice Conversion with Conditional SampleRNN

Here we present a novel approach to conditioning the SampleRNN generative model for voice conversion (VC). Conventional methods for VC modify the perceived speaker identity by converting between source and target acoustic features. Our…

Sound · Computer Science 2018-10-30 Cong Zhou , Michael Horgan , Vivek Kumar , Cristina Vasco , Dan Darcy

Neural Concatenative Singing Voice Conversion: Rethinking Concatenation-Based Approach for One-Shot Singing Voice Conversion

Any-to-any singing voice conversion (SVC) is confronted with the challenge of ``timbre leakage'' issue caused by inadequate disentanglement between the content and the speaker timbre. To address this issue, this study introduces NeuCoSVC, a…

Sound · Computer Science 2024-01-09 Binzhu Sha , Xu Li , Zhiyong Wu , Ying Shan , Helen Meng

Voice-ENHANCE: Speech Restoration using a Diffusion-based Voice Conversion Framework

We propose a speech enhancement system that combines speaker-agnostic speech restoration with voice conversion (VC) to obtain a studio-level quality speech signal. While voice conversion models are typically used to change speaker…

Sound · Computer Science 2025-05-22 Kyungguen Byun , Jason Filos , Erik Visser , Sunkuk Moon

An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning

Speaker identity is one of the important characteristics of human speech. In voice conversion, we change the speaker identity from one to another, while keeping the linguistic content unchanged. Voice conversion involves multiple speech…

Audio and Speech Processing · Electrical Eng. & Systems 2020-11-18 Berrak Sisman , Junichi Yamagishi , Simon King , Haizhou Li

Automatic Speech Disentanglement for Voice Conversion using Rank Module and Speech Augmentation

Voice Conversion (VC) converts the voice of a source speech to that of a target while maintaining the source's content. Speech can be mainly decomposed into four components: content, timbre, rhythm and pitch. Unfortunately, most related…

Sound · Computer Science 2023-06-22 Zhonghua Liu , Shijun Wang , Ning Chen

Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction

Speech signals are inherently complex as they encompass both global acoustic characteristics and local semantic information. However, in the task of target speech extraction, certain elements of global and local semantic information in the…

Sound · Computer Science 2024-08-27 Zhaoxi Mu , Xinyu Yang , Sining Sun , Qing Yang

Self-Supervised Speech Representations Preserve Speech Characteristics while Anonymizing Voices

Collecting speech data is an important step in training speech recognition systems and other speech-based machine learning models. However, the issue of privacy protection is an increasing concern that must be addressed. The current study…

Computation and Language · Computer Science 2022-04-05 Abner Hernandez , Paula Andrea Pérez-Toro , Juan Camilo Vásquez-Correa , Juan Rafael Orozco-Arroyave , Andreas Maier , Seung Hee Yang