Related papers: SelfVC: Voice Conversion With Iterative Refinement…

Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition

Self-supervised learning (SSL) based speech pre-training has attracted much attention for its capability of extracting rich representations learned from massive unlabeled data. On the other hand, the use of weakly-supervised data is less…

Audio and Speech Processing · Electrical Eng. & Systems 2023-06-30 Wangyou Zhang , Yanmin Qian

R2-SVC: Towards Real-World Robust and Expressive Zero-shot Singing Voice Conversion

In real-world singing voice conversion (SVC) applications, environmental noise and the demand for expressive output pose significant challenges. Conventional methods, however, are typically designed without accounting for real deployment…

Sound · Computer Science 2025-10-24 Junjie Zheng , Gongyu Chen , Chaofan Ding , Zihao Chen

DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion

Streaming voice conversion has become increasingly popular for its potential in real-time applications. The recently proposed DualVC 2 has achieved robust and high-quality streaming voice conversion with a latency of about 180ms.…

Audio and Speech Processing · Electrical Eng. & Systems 2024-06-13 Ziqian Ning , Shuai Wang , Pengcheng Zhu , Zhichao Wang , Jixun Yao , Lei Xie , Mengxiao Bi

Incorporating Real-world Noisy Speech in Neural-network-based Speech Enhancement Systems

Supervised speech enhancement relies on parallel databases of degraded speech signals and their clean reference signals during training. This setting prohibits the use of real-world degraded speech data that may better represent the…

Audio and Speech Processing · Electrical Eng. & Systems 2021-09-22 Yangyang Xia , Buye Xu , Anurag Kumar

Creating New Voices using Normalizing Flows

Creating realistic and natural-sounding synthetic speech remains a big challenge for voice identities unseen during training. As there is growing interest in synthesizing voices of new speakers, here we investigate the ability of…

Sound · Computer Science 2023-12-25 Piotr Bilinski , Thomas Merritt , Abdelhamid Ezzerg , Kamil Pokora , Sebastian Cygert , Kayoko Yanagisawa , Roberto Barra-Chicote , Daniel Korzekwa

DAFMSVC: One-Shot Singing Voice Conversion with Dual Attention Mechanism and Flow Matching

Singing Voice Conversion (SVC) transfers a source singer's timbre to a target while keeping melody and lyrics. The key challenge in any-to-any SVC is adapting unseen speaker timbres to source audio without quality degradation. Existing…

Sound · Computer Science 2025-08-11 Wei Chen , Binzhu Sha , Dan Luo , Jing Yang , Zhuo Wang , Fan Fan , Zhiyong Wu

Defense Against Synthetic Speech: Real-Time Detection of RVC Voice Conversion Attacks

Generative audio technologies now enable highly realistic voice cloning and real-time voice conversion, increasing the risk of impersonation, fraud, and misinformation in communication channels such as phone and video calls. This study…

Sound · Computer Science 2026-01-09 Prajwal Chinchmalatpure , Suyash Chinchmalatpure , Siddharth Chavan

Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training

Emotional voice conversion (EVC) aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity. In this paper, we propose a novel 2-stage training strategy for sequence-to-sequence emotional…

Computation and Language · Computer Science 2021-06-10 Kun Zhou , Berrak Sisman , Haizhou Li

REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers

In real-world voice conversion applications, environmental noise in source speech and user demands for expressive output pose critical challenges. Traditional ASR-based methods ensure noise robustness but suppress prosody richness, while…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-11 Yuepeng Jiang , Ziqian Ning , Shuai Wang , Chengjia Wang , Mengxiao Bi , Pengcheng Zhu , Zhonghua Fu , Lei Xie

An Adapter Based Pre-Training for Efficient and Scalable Self-Supervised Speech Representation Learning

We present a method for transferring pre-trained self-supervised (SSL) speech representations to multiple languages. There is an abundance of unannotated speech, so creating self-supervised representations from raw audio and fine-tuning on…

Audio and Speech Processing · Electrical Eng. & Systems 2022-02-08 Samuel Kessler , Bethan Thomas , Salah Karout

Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining

We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining. Seq2seq VC models are attractive owing to their ability to convert prosody. While…

Audio and Speech Processing · Electrical Eng. & Systems 2019-12-17 Wen-Chin Huang , Tomoki Hayashi , Yi-Chiao Wu , Hirokazu Kameoka , Tomoki Toda

Adversarially learning disentangled speech representations for robust multi-factor voice conversion

Factorizing speech as disentangled speech representations is vital to achieve highly controllable style transfer in voice conversion (VC). Conventional speech representation learning methods in VC only factorize speech as speaker and…

Audio and Speech Processing · Electrical Eng. & Systems 2021-12-06 Jie Wang , Jingbei Li , Xintao Zhao , Zhiyong Wu , Shiyin Kang , Helen Meng

EZ-VC: Easy Zero-shot Any-to-Any Voice Conversion

Voice Conversion research in recent times has increasingly focused on improving the zero-shot capabilities of existing methods. Despite remarkable advancements, current architectures still tend to struggle in zero-shot cross-lingual…

Sound · Computer Science 2025-05-26 Advait Joglekar , Divyanshu Singh , Rooshil Rohit Bhatia , S. Umesh

Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning

This paper presents an adversarial learning method for recognition-synthesis based non-parallel voice conversion. A recognizer is used to transform acoustic features into linguistic representations while a synthesizer recovers output…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-07 Jing-Xuan Zhang , Zhen-Hua Ling , Li-Rong Dai

Unsupervised Singing Voice Conversion

We present a deep learning method for singing voice conversion. The proposed network is not conditioned on the text or on the notes, and it directly converts the audio of one singer to the voice of another. Training is performed without any…

Machine Learning · Computer Science 2019-09-26 Eliya Nachmani , Lior Wolf

Robust Training of Singing Voice Synthesis Using Prior and Posterior Uncertainty

Singing voice synthesis (SVS) has seen remarkable advancements in recent years. However, compared to speech and general audio data, publicly available singing datasets remain limited. In practice, this data scarcity often leads to…

Sound · Computer Science 2025-12-17 Yiwen Zhao , Jiatong Shi , Yuxun Tang , William Chen , Shinji Watanabe

In-the-wild Speech Emotion Conversion Using Disentangled Self-Supervised Representations and Neural Vocoder-based Resynthesis

Speech emotion conversion aims to convert the expressed emotion of a spoken utterance to a target emotion while preserving the lexical information and the speaker's identity. In this work, we specifically focus on in-the-wild emotion…

Audio and Speech Processing · Electrical Eng. & Systems 2023-06-06 Navin Raj Prabhu , Nale Lehmann-Willenbrock , Timo Gerkmann

Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data

While many recent any-to-any voice conversion models succeed in transferring some target speech's style information to the converted speech, they still lack the ability to faithfully reproduce the speaking style of the target speaker. In…

Audio and Speech Processing · Electrical Eng. & Systems 2023-12-18 Hyungseob Lim , Kyungguen Byun , Sunkuk Moon , Erik Visser

Many-to-Many Voice Conversion based Feature Disentanglement using Variational Autoencoder

Voice conversion is a challenging task which transforms the voice characteristics of a source speaker to a target speaker without changing linguistic content. Recently, there have been many works on many-to-many Voice Conversion (VC) based…

Audio and Speech Processing · Electrical Eng. & Systems 2021-09-23 Manh Luong , Viet Anh Tran

DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion

Singing voice conversion (SVC) is one promising technique which can enrich the way of human-computer interaction by endowing a computer the ability to produce high-fidelity and expressive singing voice. In this paper, we propose DiffSVC, an…

Audio and Speech Processing · Electrical Eng. & Systems 2021-05-31 Songxiang Liu , Yuewen Cao , Dan Su , Helen Meng