Related papers: SelfVC: Voice Conversion With Iterative Refinement…

A Comparative Analysis Of Latent Regressor Losses For Singing Voice Conversion

Previous research has shown that established techniques for spoken voice conversion (VC) do not perform as well when applied to singing voice conversion (SVC). We propose an alternative loss component in a loss function that is otherwise…

Sound · Computer Science 2023-02-28 Brendan O'Connor , Simon Dixon

Singer Identity Representation Learning using Self-Supervised Techniques

Significant strides have been made in creating voice identity representations using speech data. However, the same level of progress has not been achieved for singing voices. To bridge this gap, we suggest a framework for training singer…

Sound · Computer Science 2024-01-11 Bernardo Torres , Stefan Lattner , Gaël Richard

Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments

Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker. While there is a rich literature on VC, most proposed methods are trained and…

Sound · Computer Science 2021-06-17 Alejandro Mottini , Jaime Lorenzo-Trueba , Sri Vishnu Kumar Karlapati , Thomas Drugman

FabasedVC: Enhancing Voice Conversion with Text Modality Fusion and Phoneme-Level SSL Features

In voice conversion (VC), it is crucial to preserve complete semantic information while accurately modeling the target speaker's timbre and prosody. This paper proposes FabasedVC to achieve VC with enhanced similarity in timbre, prosody,…

Sound · Computer Science 2025-11-14 Wenyu Wang , Zhetao Hu , Yiquan Zhou , Jiacheng Xu , Zhiyu Wu , Chen Li , Shihao Li

VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

Singing Voice Synthesis (SVS) has witnessed significant advancements with the advent of deep learning techniques. However, a significant challenge in SVS is the scarcity of labeled singing voice data, which limits the effectiveness of…

Sound · Computer Science 2024-12-17 Yifeng Yu , Jiatong Shi , Yuning Wu , Yuxun Tang , Shinji Watanabe

Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised Discrete Speech Representations

We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence (seq2seq) framework. A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker. We utilize…

Audio and Speech Processing · Electrical Eng. & Systems 2020-10-26 Wen-Chin Huang , Yi-Chiao Wu , Tomoki Hayashi , Tomoki Toda

Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations

Recently, cycle-consistent adversarial network (Cycle-GAN) has been successfully applied to voice conversion to a different speaker without parallel data, although in those approaches an individual model is needed for each target speaker.…

Audio and Speech Processing · Electrical Eng. & Systems 2018-06-26 Ju-chieh Chou , Cheng-chieh Yeh , Hung-yi Lee , Lin-shan Lee

AVQVC: One-shot Voice Conversion by Vector Quantization with applying contrastive learning

Voice Conversion(VC) refers to changing the timbre of a speech while retaining the discourse content. Recently, many works have focused on disentangle-based learning techniques to separate the timbre and the linguistic content information…

Sound · Computer Science 2022-02-22 Huaizhen Tang , Xulong Zhang , Jianzong Wang , Ning Cheng , Jing Xiao

A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion

Typically, singing voice conversion (SVC) depends on an embedding vector, extracted from either a speaker lookup table (LUT) or a speaker recognition network (SRN), to model speaker identity. However, singing contains more expressive…

Audio and Speech Processing · Electrical Eng. & Systems 2022-07-07 Xu Li , Shansong Liu , Ying Shan

MimicLM: Zero-Shot Voice Imitation through Autoregressive Modeling of Pseudo-Parallel Speech Corpora

Voice imitation aims to transform source speech to match a reference speaker's timbre and speaking style while preserving linguistic content. A straightforward approach is to train on triplets of (source, reference, target), where source…

Sound · Computer Science 2026-04-21 Tao Feng , Yuxiang Wang , Yuancheng Wang , Xueyao Zhang , Dekun Chen , Chaoren Wang , Xun Guan , Zhizheng Wu

Building Bilingual and Code-Switched Voice Conversion with Limited Training Data Using Embedding Consistency Loss

Building cross-lingual voice conversion (VC) systems for multiple speakers and multiple languages has been a challenging task for a long time. This paper describes a parallel non-autoregressive network to achieve bilingual and code-switched…

Audio and Speech Processing · Electrical Eng. & Systems 2021-04-23 Yaogen Yang , Haozhe Zhang , Xiaoyi Qin , Shanshan Liang , Huahua Cui , Mingyang Xu , Ming Li

Disentangled Speech Representation Learning Based on Factorized Hierarchical Variational Autoencoder with Self-Supervised Objective

Disentangled representation learning aims to extract explanatory features or factors and retain salient information. Factorized hierarchical variational autoencoder (FHVAE) presents a way to disentangle a speech signal into sequential-level…

Audio and Speech Processing · Electrical Eng. & Systems 2022-04-06 Yuying Xie , Thomas Arildsen , Zheng-Hua Tan

Towards General-Purpose Text-Instruction-Guided Voice Conversion

This paper introduces a novel voice conversion (VC) model, guided by text instructions such as "articulate slowly with a deep tone" or "speak in a cheerful boyish voice". Unlike traditional methods that rely on reference utterances to…

Audio and Speech Processing · Electrical Eng. & Systems 2024-01-17 Chun-Yi Kuan , Chen An Li , Tsu-Yuan Hsu , Tse-Yang Lin , Ho-Lam Chung , Kai-Wei Chang , Shuo-yiin Chang , Hung-yi Lee

EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning

Using unsupervised learning to disentangle speech into content, rhythm, pitch, and timbre for voice conversion has become a hot research topic. Existing works generally take into account disentangling speech components through human-crafted…

Sound · Computer Science 2024-05-01 Ziqi Liang , Jianzong Wang , Xulong Zhang , Yong Zhang , Ning Cheng , Jing Xiao

Who is Authentic Speaker

Voice conversion (VC) using deep learning technologies can now generate high quality one-to-many voices and thus has been used in some practical application fields, such as entertainment and healthcare. However, voice conversion can pose…

Sound · Computer Science 2024-05-02 Qiang Huang

MAIN-VC: Lightweight Speech Representation Disentanglement for One-shot Voice Conversion

One-shot voice conversion aims to change the timbre of any source speech to match that of the unseen target speaker with only one speech sample. Existing methods face difficulties in satisfactory speech representation disentanglement and…

Sound · Computer Science 2024-11-26 Pengcheng Li , Jianzong Wang , Xulong Zhang , Yong Zhang , Jing Xiao , Ning Cheng

Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment

We propose noise-robust voice conversion (VC) which takes into account the recording quality and environment of noisy source speech. Conventional denoising training improves the noise robustness of a VC model by learning noisy-to-clean VC…

Sound · Computer Science 2024-06-12 Takuto Igarashi , Yuki Saito , Kentaro Seki , Shinnosuke Takamichi , Ryuichi Yamamoto , Kentaro Tachibana , Hiroshi Saruwatari

Exploring Voice Conversion based Data Augmentation in Text-Dependent Speaker Verification

In this paper, we focus on improving the performance of the text-dependent speaker verification system in the scenario of limited training data. The speaker verification system deep learning based text-dependent generally needs a large…

Sound · Computer Science 2020-11-24 Xiaoyi Qin , Yaogen Yang , Lin Yang , Xuyang Wang , Junjie Wang , Ming Li

Disentangled Speech Embeddings using Cross-modal Self-supervision

The objective of this paper is to learn representations of speaker identity without access to manually annotated data. To do so, we develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces…

Audio and Speech Processing · Electrical Eng. & Systems 2020-05-05 Arsha Nagrani , Joon Son Chung , Samuel Albanie , Andrew Zisserman

Simultaneous or Sequential Training? How Speech Representations Cooperate in a Multi-Task Self-Supervised Learning System

Speech representation learning with self-supervised algorithms has resulted in notable performance boosts in many downstream tasks. Recent work combined self-supervised learning (SSL) and visually grounded speech (VGS) processing mechanisms…

Audio and Speech Processing · Electrical Eng. & Systems 2024-03-08 Khazar Khorrami , María Andrea Cruz Blandón , Tuomas Virtanen , Okko Räsänen