Related papers: Diffusion-Based Voice Conversion with Fast Maximum…

An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning

Speaker identity is one of the important characteristics of human speech. In voice conversion, we change the speaker identity from one to another, while keeping the linguistic content unchanged. Voice conversion involves multiple speech…

Audio and Speech Processing · Electrical Eng. & Systems 2020-11-18 Berrak Sisman , Junichi Yamagishi , Simon King , Haizhou Li

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

Recently, voice conversion (VC) without parallel data has been successfully adapted to multi-target scenario in which a single model is trained to convert the input voice to many different speakers. However, such model suffers from the…

Machine Learning · Computer Science 2019-08-23 Ju-chieh Chou , Cheng-chieh Yeh , Hung-yi Lee

VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

Although diffusion models in text-to-speech have become a popular choice due to their strong generative ability, the intrinsic complexity of sampling from diffusion models harms their efficiency. Alternatively, we propose VoiceFlow, an…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-04 Yiwei Guo , Chenpeng Du , Ziyang Ma , Xie Chen , Kai Yu

Voice Conversion with Denoising Diffusion Probabilistic GAN Models

Voice conversion is a method that allows for the transformation of speaking style while maintaining the integrity of linguistic information. There are many researchers using deep generative models for voice conversion tasks. Generative…

Sound · Computer Science 2023-08-29 Xulong Zhang , Jianzong Wang , Ning Cheng , Jing Xiao

A Comprehensive Survey on Diffusion Models and Their Applications

Diffusion Models are probabilistic models that create realistic samples by simulating the diffusion process, gradually adding and removing noise from data. These models have gained popularity in domains such as image processing, speech…

Computer Vision and Pattern Recognition · Computer Science 2024-08-21 Md Manjurul Ahsan , Shivakumar Raman , Yingtao Liu , Zahed Siddique

AdaptVC: High Quality Voice Conversion with Adaptive Learning

The goal of voice conversion is to transform the speech of a source speaker to sound like that of a reference speaker while preserving the original content. A key challenge is to extract disentangled linguistic content from the source and…

Sound · Computer Science 2025-01-15 Jaehun Kim , Ji-Hoon Kim , Yeunju Choi , Tan Dat Nguyen , Seongkyu Mun , Joon Son Chung

Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

We introduce DiffuseST, a low-latency, direct speech-to-speech translation system capable of preserving the input speaker's voice zero-shot while translating from multiple source languages into English. We experiment with the synthesizer…

Machine Learning · Computer Science 2024-06-17 Nameer Hirschkind , Xiao Yu , Mahesh Kumar Nandwana , Joseph Liu , Eloi DuBois , Dao Le , Nicolas Thiebaut , Colin Sinclair , Kyle Spence , Charles Shang , Zoe Abrams , Morgan McGuire

Conditional Diffusion Probabilistic Model for Speech Enhancement

Speech enhancement is a critical component of many user-oriented audio applications, yet current systems still suffer from distorted and unnatural outputs. While generative models have shown strong potential in speech synthesis, they are…

Audio and Speech Processing · Electrical Eng. & Systems 2022-02-11 Yen-Ju Lu , Zhong-Qiu Wang , Shinji Watanabe , Alexander Richard , Cheng Yu , Yu Tsao

RAVE for Speech: Efficient Voice Conversion at High Sampling Rates

Voice conversion has gained increasing popularity within the field of audio manipulation and speech synthesis. Often, the main objective is to transfer the input identity to that of a target speaker without changing its linguistic content.…

Sound · Computer Science 2024-08-30 Anders R. Bargum , Simon Lajboschitz , Cumhur Erkut

One-shot Voice Conversion For Style Transfer Based On Speaker Adaptation

One-shot style transfer is a challenging task, since training on one utterance makes model extremely easy to over-fit to training data and causes low speaker similarity and lack of expressiveness. In this paper, we build on the…

Audio and Speech Processing · Electrical Eng. & Systems 2022-02-22 Zhichao Wang , Qicong Xie , Tao Li , Hongqiang Du , Lei Xie , Pengcheng Zhu , Mengxiao Bi

A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI

Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction. With the diffusion model as the most popular generative model, numerous works have attempted two active…

Sound · Computer Science 2023-04-04 Chenshuang Zhang , Chaoning Zhang , Sheng Zheng , Mengchun Zhang , Maryam Qamar , Sung-Ho Bae , In So Kweon

TransFusion: Transcribing Speech with Multinomial Diffusion

Diffusion models have shown exceptional scaling properties in the image synthesis domain, and initial attempts have shown similar benefits for applying diffusion to unconditional text synthesis. Denoising diffusion models attempt to…

Audio and Speech Processing · Electrical Eng. & Systems 2022-10-17 Matthew Baas , Kevin Eloff , Herman Kamper

Diffuse or Confuse: A Diffusion Deepfake Speech Dataset

Advancements in artificial intelligence and machine learning have significantly improved synthetic speech generation. This paper explores diffusion models, a novel method for creating realistic synthetic speech. We create a diffusion…

Cryptography and Security · Computer Science 2025-01-15 Anton Firc , Kamil Malinka , Petr Hanáček

StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models

One-shot voice conversion (VC) aims to convert speech from any source speaker to an arbitrary target speaker with only a few seconds of reference speech from the target speaker. This relies heavily on disentangling the speaker's identity…

Audio and Speech Processing · Electrical Eng. & Systems 2023-01-02 Yinghao Aaron Li , Cong Han , Nima Mesgarani

DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion

Singing voice conversion (SVC) is one promising technique which can enrich the way of human-computer interaction by endowing a computer the ability to produce high-fidelity and expressive singing voice. In this paper, we propose DiffSVC, an…

Audio and Speech Processing · Electrical Eng. & Systems 2021-05-31 Songxiang Liu , Yuewen Cao , Dan Su , Helen Meng

Real-Time and Accurate: Zero-shot High-Fidelity Singing Voice Conversion with Multi-Condition Flow Synthesis

Singing voice conversion is to convert the source singing voice into the target singing voice except for the content. Currently, flow-based models can complete the task of voice conversion, but they struggle to effectively extract latent…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-10 Hui Li , Hongyu Wang , Zhijin Chen , Bohan Sun , Bo Li

Voice conversion with limited data and limitless data augmentations

Applying changes to an input speech signal to change the perceived speaker of speech to a target while maintaining the content of the input is a challenging but interesting task known as Voice conversion (VC). Over the last few years, this…

Sound · Computer Science 2022-12-29 Olga Slizovskaia , Jordi Janer , Pritish Chandna , Oscar Mayor

Duplex Diffusion Models Improve Speech-to-Speech Translation

Speech-to-speech translation is a typical sequence-to-sequence learning task that naturally has two directions. How to effectively leverage bidirectional supervision signals to produce high-fidelity audio for both directions? Existing…

Computation and Language · Computer Science 2023-05-23 Xianchao Wu

Highly Controllable Diffusion-based Any-to-Any Voice Conversion Model with Frame-level Prosody Feature

We propose a highly controllable voice manipulation system that can perform any-to-any voice conversion (VC) and prosody modulation simultaneously. State-of-the-art VC systems can transfer sentence-level characteristics such as speaker,…

Sound · Computer Science 2023-09-08 Kyungguen Byun , Sunkuk Moon , Erik Visser

DiffVoice: Text-to-Speech with Latent Diffusion

In this work, we present DiffVoice, a novel text-to-speech model based on latent diffusion. We propose to first encode speech signals into a phoneme-rate latent representation with a variational autoencoder enhanced by adversarial training,…

Audio and Speech Processing · Electrical Eng. & Systems 2023-04-25 Zhijun Liu , Yiwei Guo , Kai Yu