Related papers: SelfVC: Voice Conversion With Iterative Refinement…

Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations

This paper presents a method of sequence-to-sequence (seq2seq) voice conversion using non-parallel training data. In this method, disentangled linguistic and speaker representations are extracted from acoustic features, and voice conversion…

Audio and Speech Processing · Electrical Eng. & Systems 2020-01-14 Jing-Xuan Zhang , Zhen-Hua Ling , Li-Rong Dai

SynthVC: Leveraging Synthetic Data for End-to-End Low Latency Streaming Voice Conversion

Voice Conversion (VC) aims to modify a speaker's timbre while preserving linguistic content. While recent VC models achieve strong performance, most struggle in real-time streaming scenarios due to high latency, dependence on ASR modules,…

Sound · Computer Science 2025-10-13 Zhao Guo , Ziqian Ning , Guobin Ma , Lei Xie

Self-Supervised Learning for Speech Enhancement through Synthesis

Modern speech enhancement (SE) networks typically implement noise suppression through time-frequency masking, latent representation masking, or discriminative signal prediction. In contrast, some recent works explore SE via generative…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-07 Bryce Irvin , Marko Stamenovic , Mikolaj Kegler , Li-Chia Yang

Beyond Voice Identity Conversion: Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations

Voice conversion (VC) consists of digitally altering the voice of an individual to manipulate part of its content, primarily its identity, while maintaining the rest unchanged. Research in neural VC has accomplished considerable…

Sound · Computer Science 2021-07-28 Laurent Benaroya , Nicolas Obin , Axel Roebel

Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities

Voice conversion (VC) using sequence-to-sequence learning of context posterior probabilities is proposed. Conventional VC using shared context posterior probabilities predicts target speech parameters from the context posterior…

Sound · Computer Science 2017-08-08 Hiroyuki Miyoshi , Yuki Saito , Shinnosuke Takamichi , Hiroshi Saruwatari

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker…

Sound · Computer Science 2021-07-28 Adam Polyak , Yossi Adi , Jade Copet , Eugene Kharitonov , Kushal Lakhotia , Wei-Ning Hsu , Abdelrahman Mohamed , Emmanuel Dupoux

FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion

Voice conversion (VC) can be achieved by first extracting source content information and target speaker information, and then reconstructing waveform with these information. However, current approaches normally either extract dirty content…

Sound · Computer Science 2022-10-28 Jingyi li , Weiping tu , Li xiao

ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training

Style voice conversion aims to transform the speaking style of source speech into a desired style while keeping the original speaker's identity. However, previous style voice conversion approaches primarily focus on well-defined domains…

Audio and Speech Processing · Electrical Eng. & Systems 2025-01-09 Xinfa Zhu , Lei He , Yujia Xiao , Xi Wang , Xu Tan , Sheng Zhao , Lei Xie

Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation

Voice conversion is the task to transform voice characteristics of source speech while preserving content information. Nowadays, self-supervised representation learning models are increasingly utilized in content extraction. However, in…

Sound · Computer Science 2024-05-02 Yimin Deng , Jianzong Wang , Xulong Zhang , Ning Cheng , Jing Xiao

StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models

One-shot voice conversion (VC) aims to convert speech from any source speaker to an arbitrary target speaker with only a few seconds of reference speech from the target speaker. This relies heavily on disentangling the speaker's identity…

Audio and Speech Processing · Electrical Eng. & Systems 2023-01-02 Yinghao Aaron Li , Cong Han , Nima Mesgarani

VC-ENHANCE: Speech Restoration with Integrated Noise Suppression and Voice Conversion

Noise suppression (NS) algorithms are effective in improving speech quality in many cases. However, aggressive noise suppression can damage the target speech, reducing both speech intelligibility and quality despite removing the noise. This…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-11 Kyungguen Byun , Jason Filos , Erik Visser , Sunkuk Moon

LinearVC: Linear transformations of self-supervised features through the lens of voice conversion

We introduce LinearVC, a simple voice conversion method that sheds light on the structure of self-supervised representations. First, we show that simple linear transformations of self-supervised features effectively convert voices. Next, we…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-03 Herman Kamper , Benjamin van Niekerk , Julian Zaïdi , Marc-André Carbonneau

Improvement Speaker Similarity for Zero-Shot Any-to-Any Voice Conversion of Whispered and Regular Speech

Zero-shot voice conversion aims to transfer the voice of a source speaker to that of a speaker unseen during training, while preserving the content information. Although various methods have been proposed to reconstruct speaker information…

Sound · Computer Science 2024-08-22 Anastasia Avdeeva , Aleksei Gusev

Singing Voice Conversion with Accompaniment Using Self-Supervised Representation-Based Melody Features

Melody preservation is crucial in singing voice conversion (SVC). However, in many scenarios, audio is often accompanied with background music (BGM), which can cause audio distortion and interfere with the extraction of melody and other key…

Sound · Computer Science 2025-02-10 Wei Chen , Binzhu Sha , Jing Yang , Zhuo Wang , Fan Fan , Zhiyong Wu

Invertible Voice Conversion

In this paper, we propose an invertible deep learning framework called INVVC for voice conversion. It is designed against the possible threats that inherently come along with voice conversion systems. Specifically, we develop an invertible…

Audio and Speech Processing · Electrical Eng. & Systems 2022-01-27 Zexin Cai , Ming Li

Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for Robust Polyglot Text-To-Speech

In this work, we introduce a framework for cross-lingual speech synthesis, which involves an upstream Voice Conversion (VC) model and a downstream Text-To-Speech (TTS) model. The proposed framework consists of 4 stages. In the first two…

Audio and Speech Processing · Electrical Eng. & Systems 2023-09-18 Dariusz Piotrowski , Renard Korzeniowski , Alessio Falai , Sebastian Cygert , Kamil Pokora , Georgi Tinchev , Ziyao Zhang , Kayoko Yanagisawa

S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations

Any-to-any voice conversion (VC) aims to convert the timbre of utterances from and to any speakers seen or unseen during training. Various any-to-any VC approaches have been proposed like AUTOVC, AdaINVC, and FragmentVC. AUTOVC, and AdaINVC…

Audio and Speech Processing · Electrical Eng. & Systems 2021-06-15 Jheng-hao Lin , Yist Y. Lin , Chung-Ming Chien , Hung-yi Lee

Singing Voice Conversion with Disentangled Representations of Singer and Vocal Technique Using Variational Autoencoders

We propose a flexible framework that deals with both singer conversion and singers vocal technique conversion. The proposed model is trained on non-parallel corpora, accommodates many-to-many conversion, and leverages recent advances of…

Audio and Speech Processing · Electrical Eng. & Systems 2020-02-26 Yin-Jyun Luo , Chin-Chen Hsu , Kat Agres , Dorien Herremans

DisC-VC: Disentangled and F0-Controllable Neural Voice Conversion

Voice conversion is a task to convert a non-linguistic feature of a given utterance. Since naturalness of speech strongly depends on its pitch pattern, in some applications, it would be desirable to keep the original rise/fall pitch pattern…

Audio and Speech Processing · Electrical Eng. & Systems 2022-10-21 Chihiro Watanabe , Hirokazu Kameoka

Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy

Zero-shot voice conversion (VC) aims to transform source speech into arbitrary unseen target voice while keeping the linguistic content unchanged. Recent VC methods have made significant progress, but semantic losses in the decoupling…

Sound · Computer Science 2024-06-17 Linhan Ma , Xinfa Zhu , Yuanjun Lv , Zhichao Wang , Ziqian Wang , Wendi He , Hongbin Zhou , Lei Xie