Related papers: SelfVC: Voice Conversion With Iterative Refinement…

StableVC: Style Controllable Zero-Shot Voice Conversion with Conditional Flow Matching

Zero-shot voice conversion (VC) aims to transfer the timbre from the source speaker to an arbitrary unseen speaker while preserving the original linguistic content. Despite recent advancements in zero-shot VC using language model-based or…

Audio and Speech Processing · Electrical Eng. & Systems 2024-12-11 Jixun Yao , Yuguang Yang , Yu Pan , Ziqian Ning , Jiaohao Ye , Hongbin Zhou , Lei Xie

Leveraging Symmetrical Convolutional Transformer Networks for Speech to Singing Voice Style Transfer

In this paper, we propose a model to perform style transfer of speech to singing voice. Contrary to the previous signal processing-based methods, which require high-quality singing templates or phoneme synchronization, we explore a…

Sound · Computer Science 2022-08-29 Shrutina Agarwal , Sriram Ganapathy , Naoya Takahashi

Noisy-to-Noisy Voice Conversion Framework with Denoising Model

In a conventional voice conversion (VC) framework, a VC model is often trained with a clean dataset consisting of speech data carefully recorded and selected by minimizing background interference. However, collecting such a high-quality…

Sound · Computer Science 2021-09-23 Chao Xie , Yi-Chiao Wu , Patrick Lumban Tobing , Wen-Chin Huang , Tomoki Toda

Intra-class variation reduction of speaker representation in disentanglement framework

In this paper, we propose an effective training strategy to ex-tract robust speaker representations from a speech signal. Oneof the key challenges in speaker recognition tasks is to learnlatent representations or embeddings containing…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-05 Yoohwan Kwon , Soo-Whan Chung , Hong-Goo Kang

Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion

Speech-to-singing voice conversion (STS) task always suffers from data scarcity, because it requires paired speech and singing data. Compounding this issue are the challenges of content-pitch alignment and the suboptimal quality of…

Audio and Speech Processing · Electrical Eng. & Systems 2024-06-05 Ruiqi Li , Rongjie Huang , Yongqi Wang , Zhiqing Hong , Zhou Zhao

An Adaptive Learning based Generative Adversarial Network for One-To-One Voice Conversion

Voice Conversion (VC) emerged as a significant domain of research in the field of speech synthesis in recent years due to its emerging application in voice-assisting technology, automated movie dubbing, and speech-to-singing conversion to…

Sound · Computer Science 2021-04-27 Sandipan Dhar , Nanda Dulal Jana , Swagatam Das

From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint

High-fidelity speech can be synthesized by end-to-end text-to-speech models in recent years. However, accessing and controlling speech attributes such as speaker identity, prosody, and emotion in a text-to-speech system remains a challenge.…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-05 Zexin Cai , Chuxiong Zhang , Ming Li

Improving Data Augmentation-based Cross-Speaker Style Transfer for TTS with Singing Voice, Style Filtering, and F0 Matching

The goal of cross-speaker style transfer in TTS is to transfer a speech style from a source speaker with expressive data to a target speaker with only neutral data. In this context, we propose using a pre-trained singing voice conversion…

Audio and Speech Processing · Electrical Eng. & Systems 2024-10-10 Leonardo B. de M. M. Marques , Lucas H. Ueda , Mário U. Neto , Flávio O. Simões , Fernando Runstein , Bianca Dal Bó , Paula D. P. Costa

Generative Adversarial Networks for Unpaired Voice Transformation on Impaired Speech

This paper focuses on using voice conversion (VC) to improve the speech intelligibility of surgical patients who have had parts of their articulators removed. Due to the difficulty of data collection, VC without parallel data is highly…

Audio and Speech Processing · Electrical Eng. & Systems 2019-08-26 Li-Wei Chen , Hung-Yi Lee , Yu Tsao

Disentangled Speech Representation Learning for One-Shot Cross-lingual Voice Conversion Using $\beta$-VAE

We propose an unsupervised learning method to disentangle speech into content representation and speaker identity representation. We apply this method to the challenging one-shot cross-lingual voice conversion task to demonstrate the…

Audio and Speech Processing · Electrical Eng. & Systems 2022-10-26 Hui Lu , Disong Wang , Xixin Wu , Zhiyong Wu , Xunying Liu , Helen Meng

Are disentangled representations all you need to build speaker anonymization systems?

Speech signals contain a lot of sensitive information, such as the speaker's identity, which raises privacy concerns when speech data get collected. Speaker anonymization aims to transform a speech signal to remove the source speaker's…

Sound · Computer Science 2023-01-16 Pierre Champion , Denis Jouvet , Anthony Larcher

Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment

This paper presents a novel task, zero-shot voice conversion based on face images (zero-shot FaceVC), which aims at converting the voice characteristics of an utterance from any source speaker to a newly coming target speaker, solely…

Sound · Computer Science 2023-09-19 Zheng-Yan Sheng , Yang Ai , Yan-Nian Chen , Zhen-Hua Ling

ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion

This paper proposes a voice conversion (VC) method using sequence-to-sequence (seq2seq or S2S) learning, which flexibly converts not only the voice characteristics but also the pitch contour and duration of input speech. The proposed…

Sound · Computer Science 2020-10-08 Hirokazu Kameoka , Kou Tanaka , Damian Kwasny , Takuhiro Kaneko , Nobukatsu Hojo

Disentangling Prosody Representations with Unsupervised Speech Reconstruction

Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity…

Sound · Computer Science 2023-09-27 Leyuan Qu , Taihao Li , Cornelius Weber , Theresa Pekarek-Rosin , Fuji Ren , Stefan Wermter

Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations

We present a neural analysis and synthesis (NANSY) framework that can manipulate voice, pitch, and speed of an arbitrary speech signal. Most of the previous works have focused on using information bottleneck to disentangle analysis features…

Sound · Computer Science 2021-10-29 Hyeong-Seok Choi , Juheon Lee , Wansoo Kim , Jie Hwan Lee , Hoon Heo , Kyogu Lee

Speech Synthesis along Perceptual Voice Quality Dimensions

While expressive speech synthesis or voice conversion systems mainly focus on controlling or manipulating abstract prosodic characteristics of speech, such as emotion or accent, we here address the control of perceptual voice qualities…

Audio and Speech Processing · Electrical Eng. & Systems 2025-01-16 Frederik Rautenberg , Michael Kuhlmann , Fritz Seebauer , Jana Wiechmann , Petra Wagner , Reinhold Haeb-Umbach

Text-free non-parallel many-to-many voice conversion using normalising flows

Non-parallel voice conversion (VC) is typically achieved using lossy representations of the source speech. However, ensuring only speaker identity information is dropped whilst all other information from the source speech is retained is a…

Audio and Speech Processing · Electrical Eng. & Systems 2022-03-16 Thomas Merritt , Abdelhamid Ezzerg , Piotr Biliński , Magdalena Proszewska , Kamil Pokora , Roberto Barra-Chicote , Daniel Korzekwa

MCR-Data2vec 2.0: Improving Self-supervised Speech Pre-training via Model-level Consistency Regularization

Self-supervised learning (SSL) has shown significant progress in speech processing tasks. However, despite the intrinsic randomness in the Transformer structure, such as dropout variants and layer-drop, improving the model-level consistency…

Audio and Speech Processing · Electrical Eng. & Systems 2023-06-16 Ji Won Yoon , Seok Min Kim , Nam Soo Kim

Voice Conversion Can Improve ASR in Very Low-Resource Settings

Voice conversion (VC) could be used to improve speech recognition systems in low-resource languages by using it to augment limited training data. However, VC has not been widely used for this purpose because of practical issues such as…

Audio and Speech Processing · Electrical Eng. & Systems 2022-06-22 Matthew Baas , Herman Kamper

Learning Explicit Prosody Models and Deep Speaker Embeddings for Atypical Voice Conversion

Though significant progress has been made for the voice conversion (VC) of typical speech, VC for atypical speech, e.g., dysarthric and second-language (L2) speech, remains a challenge, since it involves correcting for atypical prosody…

Audio and Speech Processing · Electrical Eng. & Systems 2021-07-26 Disong Wang , Songxiang Liu , Lifa Sun , Xixin Wu , Xunying Liu , Helen Meng