Related papers: SelfVC: Voice Conversion With Iterative Refinement…

Improving Voice Quality in Speech Anonymization With Just Perception-Informed Losses

The increasing use of cloud-based speech assistants has heightened the need for effective speech anonymization, which aims to obscure a speaker's identity while retaining critical information for subsequent tasks. One approach to achieving…

Artificial Intelligence · Computer Science 2024-10-22 Suhita Ghosh , Tim Thiele , Frederic Lorbeer , Frank Dreyer , Sebastian Stober

Optimizing voice conversion network with cycle consistency loss of speaker identity

We propose a novel training scheme to optimize voice conversion network with a speaker identity loss function. The training scheme not only minimizes frame-level spectral loss, but also speaker identity loss. We introduce a cycle…

Sound · Computer Science 2020-11-18 Hongqiang Du , Xiaohai Tian , Lei Xie , Haizhou Li

QR-VC: Leveraging Quantization Residuals for Linear Disentanglement in Zero-Shot Voice Conversion

Zero-shot voice conversion is a technique that alters the speaker identity of an input speech to match a target speaker using only a single reference utterance, without requiring additional training. Recent approaches extensively utilize…

Sound · Computer Science 2025-09-11 Youngjun Sim , Jinsung Yoon , Wooyeol Jeong , Young-Joo Suh

VoiceRestore: Flow-Matching Transformers for Speech Recording Quality Restoration

We present VoiceRestore, a novel approach to restoring the quality of speech recordings using flow-matching Transformers trained in a self-supervised manner on synthetic data. Our method tackles a wide range of degradations frequently found…

Audio and Speech Processing · Electrical Eng. & Systems 2025-01-03 Stanislav Kirdey

Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear Equation

Self-supervised learning (SSL) has reduced the reliance on expensive labeling in speech technologies by learning meaningful representations from unannotated data. Since most SSL-based downstream tasks prioritize content information in…

Sound · Computer Science 2025-05-27 Giuseppe Ruggiero , Matteo Testa , Jurgen Van de Walle , Luigi Di Caro

Self-supervised Speaker Recognition Training Using Human-Machine Dialogues

Speaker recognition, recognizing speaker identities based on voice alone, enables important downstream applications, such as personalization and authentication. Learning speaker representations, in the context of supervised learning,…

Machine Learning · Computer Science 2022-07-13 Metehan Cekic , Ruirui Li , Zeya Chen , Yuguang Yang , Andreas Stolcke , Upamanyu Madhow

Towards noise-robust speech inversion through multi-task learning with speech enhancement

Recent studies demonstrate the effectiveness of Self Supervised Learning (SSL) speech representations for Speech Inversion (SI). However, applying SI in real-world scenarios remains challenging due to the pervasive presence of background…

Audio and Speech Processing · Electrical Eng. & Systems 2026-01-22 Saba Tabatabaee , Carol Espy-Wilson

SYKI-SVC: Advancing Singing Voice Conversion with Post-Processing Innovations and an Open-Source Professional Testset

Singing voice conversion aims to transform a source singing voice into that of a target singer while preserving the original lyrics, melody, and various vocal techniques. In this paper, we propose a high-fidelity singing voice conversion…

Sound · Computer Science 2025-01-07 Yiquan Zhou , Wenyu Wang , Hongwu Ding , Jiacheng Xu , Jihua Zhu , Xin Gao , Shihao Li

HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios

Zero-shot singing voice conversion (SVC) transforms a source singer's timbre to an unseen target speaker's voice while preserving melodic content without fine-tuning. Existing methods model speaker timbre and vocal content separately,…

Sound · Computer Science 2025-11-18 Bingsong Bai , Yizhong Geng , Fengping Wang , Cong Wang , Puyuan Guo , Yingming Gao , Ya Li

CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

Zero-shot voice conversion (VC) aims to convert the original speaker's timbre to any target speaker while keeping the linguistic content. Current mainstream zero-shot voice conversion approaches depend on pre-trained recognition models to…

Sound · Computer Science 2024-12-04 Yuke Li , Xinfa Zhu , Hanzhao Li , JiXun Yao , WenJie Tian , XiPeng Yang , YunLin Chen , Zhifei Li , Lei Xie

Discrete Unit based Masking for Improving Disentanglement in Voice Conversion

Voice conversion (VC) aims to modify the speaker's identity while preserving the linguistic content. Commonly, VC methods use an encoder-decoder architecture, where disentangling the speaker's identity from linguistic information is…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-19 Philip H. Lee , Ismail Rasim Ulgen , Berrak Sisman

Data Augmentation for Diverse Voice Conversion in Noisy Environments

Voice conversion (VC) models have demonstrated impressive few-shot conversion quality on the clean, native speech populations they're trained on. However, when source or target speech accents, background noise conditions, or microphone…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-19 Avani Tanna , Michael Saxon , Amr El Abbadi , William Yang Wang

Speaking Style Conversion in the Waveform Domain Using Discrete Self-Supervised Units

We introduce DISSC, a novel, lightweight method that converts the rhythm, pitch contour and timbre of a recording to a target speaker in a textless manner. Unlike DISSC, most voice conversion (VC) methods focus primarily on timbre, and…

Sound · Computer Science 2023-10-20 Gallil Maimon , Yossi Adi

Disentangling Voice and Content with Self-Supervision for Speaker Recognition

For speaker recognition, it is difficult to extract an accurate speaker representation from speech because of its mixture of speaker traits and content. This paper proposes a disentanglement framework that simultaneously models speaker…

Audio and Speech Processing · Electrical Eng. & Systems 2023-11-02 Tianchi Liu , Kong Aik Lee , Qiongqiong Wang , Haizhou Li

An Evaluation of Three-Stage Voice Conversion Framework for Noisy and Reverberant Conditions

This paper presents a new voice conversion (VC) framework capable of dealing with both additive noise and reverberation, and its performance evaluation. There have been studied some VC researches focusing on real-world circumstances where…

Sound · Computer Science 2022-07-01 Yeonjong Choi , Chao Xie , Tomoki Toda

Personalized Speech Enhancement through Self-Supervised Data Augmentation and Purification

Training personalized speech enhancement models is innately a no-shot learning problem due to privacy constraints and limited access to noise-free speech from the target user. If there is an abundance of unlabeled noisy speech from the…

Audio and Speech Processing · Electrical Eng. & Systems 2021-04-06 Aswin Sivaraman , Sunwoo Kim , Minje Kim

Semi-supervised voice conversion with amortized variational inference

In this work we introduce a semi-supervised approach to the voice conversion problem, in which speech from a source speaker is converted into speech of a target speaker. The proposed method makes use of both parallel and non-parallel…

Machine Learning · Statistics 2019-10-02 Cory Stephenson , Gokce Keskin , Anil Thomas , Oguz H. Elibol

End-to-End Voice Conversion with Information Perturbation

The ideal goal of voice conversion is to convert the source speaker's speech to sound naturally like the target speaker while maintaining the linguistic content and the prosody of the source speech. However, current approaches are…

Audio and Speech Processing · Electrical Eng. & Systems 2022-06-16 Qicong Xie , Shan Yang , Yi Lei , Lei Xie , Dan Su

Generating Novel and Realistic Speakers for Voice Conversion

Voice conversion models modify timbre while preserving paralinguistic features, enabling applications like dubbing and identity protection. However, most VC systems require access to target utterances, limiting their use when target data is…

Sound · Computer Science 2025-11-11 Meiying Melissa Chen , Zhenyu Wang , Zhiyao Duan

Unsupervised Representation Disentanglement using Cross Domain Features and Adversarial Learning in Variational Autoencoder based Voice Conversion

An effective approach for voice conversion (VC) is to disentangle linguistic content from other components in the speech signal. The effectiveness of variational autoencoder (VAE) based VC (VAE-VC), for instance, strongly relies on this…

Audio and Speech Processing · Electrical Eng. & Systems 2020-04-09 Wen-Chin Huang , Hao Luo , Hsin-Te Hwang , Chen-Chou Lo , Yu-Huai Peng , Yu Tsao , Hsin-Min Wang