Related papers: SelfVC: Voice Conversion With Iterative Refinement…

Investigating self-supervised features for expressive, multilingual voice conversion

Voice conversion (VC) systems are widely used for several applications, from speaker anonymisation to personalised speech synthesis. Supervised approaches learn a mapping between different speakers using parallel data, which is expensive to…

Audio and Speech Processing · Electrical Eng. & Systems 2025-05-14 Álvaro Martín-Cortinas , Daniel Sáez-Trigueros , Grzegorz Beringer , Iván Vallés-Pérez , Roberto Barra-Chicote , Biel Tura-Vecino , Adam Gabryś , Piotr Bilinski , Thomas Merritt , Jaime Lorenzo-Trueba

Training Robust Zero-Shot Voice Conversion Models with Self-supervised Features

Unsupervised Zero-Shot Voice Conversion (VC) aims to modify the speaker characteristic of an utterance to match an unseen target speaker without relying on parallel training data. Recently, self-supervised learning of speech representation…

Sound · Computer Science 2022-02-14 Trung Dang , Dung Tran , Peter Chin , Kazuhito Koishida

AdaptVC: High Quality Voice Conversion with Adaptive Learning

The goal of voice conversion is to transform the speech of a source speaker to sound like that of a reference speaker while preserving the original content. A key challenge is to extract disentangled linguistic content from the source and…

Sound · Computer Science 2025-01-15 Jaehun Kim , Ji-Hoon Kim , Yeunju Choi , Tan Dat Nguyen , Seongkyu Mun , Joon Son Chung

Learning in your voice: Non-parallel voice conversion based on speaker consistency loss

In this paper, we propose a novel voice conversion strategy to resolve the mismatch between the training and conversion scenarios when parallel speech corpus is unavailable for training. Based on auto-encoder and disentanglement frameworks,…

Audio and Speech Processing · Electrical Eng. & Systems 2020-11-05 Yoohwan Kwon , Soo-Whan Chung , Hee-Soo Heo , Hong-Goo Kang

O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion

Traditional voice conversion (VC) methods typically attempt to separate speaker identity and linguistic information into distinct representations, which are then combined to reconstruct the audio. However, effectively disentangling these…

Sound · Computer Science 2025-10-13 Huu Tuong Tu , Huan Vu , cuong tien nguyen , Dien Hy Ngo , Nguyen Thi Thu Trang

Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion

Nowadays, recognition-synthesis-based methods have been quite popular with voice conversion (VC). By introducing linguistics features with good disentangling characters extracted from an automatic speech recognition (ASR) model, the VC…

Sound · Computer Science 2023-05-17 Xintao Zhao , Shuai Wang , Yang Chao , Zhiyong Wu , Helen Meng

Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval

Voice conversion refers to transferring speaker identity with well-preserved content. Better disentanglement of speech representations leads to better voice conversion. Recent studies have found that phonetic information from input audio…

Sound · Computer Science 2024-01-19 Yimin Deng , Huaizhen Tang , Xulong Zhang , Ning Cheng , Jing Xiao , Jianzong Wang

ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations

In this work, we propose a zero-shot voice conversion method using speech representations trained with self-supervised learning. First, we develop a multi-task model to decompose a speech utterance into features such as linguistic content,…

Sound · Computer Science 2023-02-17 Shehzeen Hussain , Paarth Neekhara , Jocelyn Huang , Jason Li , Boris Ginsburg

StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion

Voice Conversion (VC) modifies speech to match a target speaker while preserving linguistic content. Traditional methods usually extract speaker information directly from speech while neglecting the explicit utilization of linguistic…

Multimedia · Computer Science 2025-06-04 Fengjin Li , Jie Wang , Yadong Niu , Yongqing Wang , Meng Meng , Jian Luan , Zhiyong Wu

Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning

Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, is an attractive research topic as it enables a range of applications like voice customizing, animation production, and others. Recent work in this area made progress…

Sound · Computer Science 2022-06-01 Shijun Wang , Dimche Kostadinov , Damian Borth

GenVC: Self-Supervised Zero-Shot Voice Conversion

Most current zero-shot voice conversion methods rely on externally supervised components, particularly speaker encoders, for training. To explore alternatives that eliminate this dependency, this paper introduces GenVC, a novel framework…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-21 Zexin Cai , Henry Li Xinyuan , Ashi Garg , Leibny Paola García-Perera , Kevin Duh , Sanjeev Khudanpur , Matthew Wiesner , Nicholas Andrews

DRVC: A Framework of Any-to-Any Voice Conversion with Self-Supervised Learning

Any-to-any voice conversion problem aims to convert voices for source and target speakers, which are out of the training data. Previous works wildly utilize the disentangle-based models. The disentangle-based model assumes the speech…

Sound · Computer Science 2022-02-23 Qiqi Wang , Xulong Zhang , Jianzong Wang , Ning Cheng , Jing Xiao

Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion

Currently, zero-shot voice conversion systems are capable of synthesizing the voice of unseen speakers. However, most existing approaches struggle to accurately replicate the speaking style of the source speaker or mimic the distinctive…

Sound · Computer Science 2025-06-02 Kaidi Wang , Wenhao Guan , Ziyue Jiang , Hukai Huang , Peijie Chen , Weijie Wu , Qingyang Hong , Lin Li

Degrading Voice: A Comprehensive Overview of Robust Voice Conversion Through Input Manipulation

Identity, accent, style, and emotions are essential components of human speech. Voice conversion (VC) techniques process the speech signals of two input speakers and other modalities of auxiliary information such as prompts and emotion…

Audio and Speech Processing · Electrical Eng. & Systems 2025-12-09 Xining Song , Zhihua Wei , Rui Wang , Haixiao Hu , Yanxiang Chen , Meng Han

Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion

Face-based Voice Conversion (FVC) is a novel task that leverages facial images to generate the target speaker's voice style. Previous work has two shortcomings: (1) suffering from obtaining facial embeddings that are well-aligned with the…

Sound · Computer Science 2024-09-05 Yan Rong , Li Liu

Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations

Large Language Models (LLMs) are one of the most promising technologies for the next era of speech generation systems, due to their scalability and in-context learning capabilities. Nevertheless, they suffer from multiple stability issues…

Audio and Speech Processing · Electrical Eng. & Systems 2024-02-07 Álvaro Martín-Cortinas , Daniel Sáez-Trigueros , Iván Vallés-Pérez , Biel Tura-Vecino , Piotr Biliński , Mateusz Lajszczak , Grzegorz Beringer , Roberto Barra-Chicote , Jaime Lorenzo-Trueba

Stepback: Enhanced Disentanglement for Voice Conversion via Multi-Task Learning

Voice conversion (VC) modifies voice characteristics while preserving linguistic content. This paper presents the Stepback network, a novel model for converting speaker identity using non-parallel data. Unlike traditional VC methods that…

Sound · Computer Science 2025-01-28 Qian Yang , Calbert Graham

Decoupling Speaker-Independent Emotions for Voice Conversion Via Source-Filter Networks

Emotional voice conversion (VC) aims to convert a neutral voice to an emotional (e.g. happy) one while retaining the linguistic information and speaker identity. We note that the decoupling of emotional features from other speech…

Audio and Speech Processing · Electrical Eng. & Systems 2021-10-05 Zhaojie Luo , Shoufeng Lin , Rui Liu , Jun Baba , Yuichiro Yoshikawa , Ishiguro Hiroshi

NoiseVC: Towards High Quality Zero-Shot Voice Conversion

Voice conversion (VC) is a task that transforms voice from target audio to source without losing linguistic contents, it is challenging especially when source and target speakers are unseen during training (zero-shot VC). Previous…

Sound · Computer Science 2021-04-14 Shijun Wang , Damian Borth

AGAIN-VC: A One-shot Voice Conversion using Activation Guidance and Adaptive Instance Normalization

Recently, voice conversion (VC) has been widely studied. Many VC systems use disentangle-based learning techniques to separate the speaker and the linguistic content information from a speech signal. Subsequently, they convert the voice by…

Audio and Speech Processing · Electrical Eng. & Systems 2020-11-03 Yen-Hao Chen , Da-Yi Wu , Tsung-Han Wu , Hung-yi Lee