Related papers: SelfVC: Voice Conversion With Iterative Refinement…

Unsupervised Learning of Disentangled Speech Content and Style Representation

We present an approach for unsupervised learning of speech representation disentangling contents and styles. Our model consists of: (1) a local encoder that captures per-frame information; (2) a global encoder that captures per-utterance…

Computation and Language · Computer Science 2021-06-22 Andros Tjandra , Ruoming Pang , Yu Zhang , Shigeki Karita

Voice Reenactment with F0 and timing constraints and adversarial learning of conversions

This paper introduces voice reenactement as the task of voice conversion (VC) in which the expressivity of the source speaker is preserved during conversion while the identity of a target speaker is transferred. To do so, an original…

Sound · Computer Science 2022-06-01 Frederik Bous , Laurent Benaroya , Nicolas Obin , Axel Roebel

Pretraining Techniques for Sequence-to-Sequence Voice Conversion

Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody. Nonetheless, without sufficient data, seq2seq VC models can suffer from unstable training and mispronunciation problems in…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-10 Wen-Chin Huang , Tomoki Hayashi , Yi-Chiao Wu , Hirokazu Kameoka , Tomoki Toda

Voice Conversion Augmentation for Speaker Recognition on Defective Datasets

Modern speaker recognition system relies on abundant and balanced datasets for classification training. However, diverse defective datasets, such as partially-labelled, small-scale, and imbalanced datasets, are common in real-world…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-03 Ruijie Tao , Zhan Shi , Yidi Jiang , Tianchi Liu , Haizhou Li

PPG-based singing voice conversion with adversarial representation learning

Singing voice conversion (SVC) aims to convert the voice of one singer to that of other singers while keeping the singing content and melody. On top of recent voice conversion works, we propose a novel model to steadily convert songs while…

Sound · Computer Science 2020-10-29 Zhonghao Li , Benlai Tang , Xiang Yin , Yuan Wan , Ling Xu , Chen Shen , Zejun Ma

FastVC: Fast Voice Conversion with non-parallel data

This paper introduces FastVC, an end-to-end model for fast Voice Conversion (VC). The proposed model can convert speech of arbitrary length from multiple source speakers to multiple target speakers. FastVC is based on a conditional…

Audio and Speech Processing · Electrical Eng. & Systems 2021-05-07 Oriol Barbany Mayor , Milos Cernak

Self-Attention Linguistic-Acoustic Decoder

The conversion from text to speech relies on the accurate mapping from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models like recurrent neural networks. Despite the good performance of…

Sound · Computer Science 2018-11-07 Santiago Pascual , Antonio Bonafonte , Joan Serrà

Speech SIMCLR: Combining Contrastive and Reconstruction Objective for Self-supervised Speech Representation Learning

Self-supervised visual pretraining has shown significant progress recently. Among those methods, SimCLR greatly advanced the state of the art in self-supervised and semi-supervised learning on ImageNet. The input feature representations for…

Computation and Language · Computer Science 2021-07-06 Dongwei Jiang , Wubo Li , Miao Cao , Wei Zou , Xiangang Li

Unsupervised Speech Enhancement with speech recognition embedding and disentanglement losses

Speech enhancement has recently achieved great success with various deep learning methods. However, most conventional speech enhancement systems are trained with supervised methods that impose two significant challenges. First, a majority…

Audio and Speech Processing · Electrical Eng. & Systems 2022-02-22 Viet Anh Trinh , Sebastian Braun

Direct Noisy Speech Modeling for Noisy-to-Noisy Voice Conversion

Beyond the conventional voice conversion (VC) where the speaker information is converted without altering the linguistic content, the background sounds are informative and need to be retained in some real-world scenarios, such as VC in…

Sound · Computer Science 2021-11-16 Chao Xie , Yi-Chiao Wu , Patrick Lumban Tobing , Wen-Chin Huang , Tomoki Toda

Layer-wise Analysis of a Self-supervised Speech Representation Model

Recently proposed self-supervised learning approaches have been successful for pre-training speech representation models. The utility of these learned representations has been observed empirically, but not much has been studied about the…

Computation and Language · Computer Science 2022-12-06 Ankita Pasad , Ju-Chieh Chou , Karen Livescu

Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling

Singing Voice Conversion (SVC) aims to transform a source singing voice into a target singer while preserving lyrics and melody. Most existing SVC methods depend on F0 extractors to capture the lead melody from clean vocals. However, no…

Sound · Computer Science 2026-05-13 Chen Geng , Meng Chen , Ruohua Zhou , Ruolan Liu , Weifeng Zhao

Efficient Personalized Speech Enhancement through Self-Supervised Learning

This work presents self-supervised learning methods for developing monaural speaker-specific (i.e., personalized) speech enhancement models. While generalist models must broadly address many speakers, specialist models can adapt their…

Audio and Speech Processing · Electrical Eng. & Systems 2022-07-28 Aswin Sivaraman , Minje Kim

Improving speaker verification robustness with synthetic emotional utterances

A speaker verification (SV) system offers an authentication service designed to confirm whether a given speech sample originates from a specific speaker. This technology has paved the way for various personalized applications that cater to…

Sound · Computer Science 2024-12-03 Nikhil Kumar Koditala , Chelsea Jui-Ting Ju , Ruirui Li , Minho Jin , Aman Chadha , Andreas Stolcke

Transfer Learning from Monolingual ASR to Transcription-free Cross-lingual Voice Conversion

Cross-lingual voice conversion (VC) is a task that aims to synthesize target voices with the same content while source and target speakers speak in different languages. Its challenge lies in the fact that the source and target data are…

Audio and Speech Processing · Electrical Eng. & Systems 2020-10-01 Che-Jui Chang

PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts

Style voice conversion aims to transform the style of source speech to a desired style according to real-world application demands. However, the current style voice conversion approach relies on pre-defined labels or reference speech to…

Audio and Speech Processing · Electrical Eng. & Systems 2023-12-27 Jixun Yao , Yuguang Yang , Yi Lei , Ziqian Ning , Yanni Hu , Yu Pan , Jingjing Yin , Hongbin Zhou , Heng Lu , Lei Xie

Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion

Expressive voice conversion aims to transfer both speaker identity and expressive attributes from a target speech to a given source speech. In this work, we improve over a self-supervised, non-autoregressive framework with a conditional…

Sound · Computer Science 2025-06-05 Seymanur Akti , Tuan Nam Nguyen , Alexander Waibel

Emotional Voice Conversion using Multitask Learning with Text-to-speech

Voice conversion (VC) is a task to transform a person's voice to different style while conserving linguistic contents. Previous state-of-the-art on VC is based on sequence-to-sequence (seq2seq) model, which could mislead linguistic…

Audio and Speech Processing · Electrical Eng. & Systems 2019-11-28 Tae-Ho Kim , Sungjae Cho , Shinkook Choi , Sejik Park , Soo-Young Lee

End-to-End Zero-Shot Voice Conversion with Location-Variable Convolutions

Zero-shot voice conversion is becoming an increasingly popular research topic, as it promises the ability to transform speech to sound like any speaker. However, relatively little work has been done on end-to-end methods for this task,…

Audio and Speech Processing · Electrical Eng. & Systems 2024-04-04 Wonjune Kang , Mark Hasegawa-Johnson , Deb Roy

A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion

We propose a new paradigm for maintaining speaker identity in dysarthric voice conversion (DVC). The poor quality of dysarthric speech can be greatly improved by statistical VC, but as the normal speech utterances of a dysarthria patient…

Sound · Computer Science 2021-06-04 Wen-Chin Huang , Kazuhiro Kobayashi , Yu-Huai Peng , Ching-Feng Liu , Yu Tsao , Hsin-Min Wang , Tomoki Toda