Related papers: Speaker Generation

VoiceLens: Controllable Speaker Generation and Editing with Flow

Currently, many multi-speaker speech synthesis and voice conversion systems address speaker variations with an embedding vector. Modeling it directly allows new voices outside of training data to be synthesized. GMM based approaches such as…

Sound · Computer Science 2023-09-26 Yao Shi , Ming Li

Mid-attribute speaker generation using optimal-transport-based interpolation of Gaussian mixture models

In this paper, we propose a method for intermediating multiple speakers' attributes and diversifying their voice characteristics in ``speaker generation,'' an emerging task that aims to synthesize a nonexistent speaker's naturally sounding…

Sound · Computer Science 2022-10-19 Aya Watanabe , Shinnosuke Takamichi , Yuki Saito , Detai Xin , Hiroshi Saruwatari

Tacotron: Towards End-to-End Speech Synthesis

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain…

Computation and Language · Computer Science 2017-04-10 Yuxuan Wang , RJ Skerry-Ryan , Daisy Stanton , Yonghui Wu , Ron J. Weiss , Navdeep Jaitly , Zongheng Yang , Ying Xiao , Zhifeng Chen , Samy Bengio , Quoc Le , Yannis Agiomyrgiannakis , Rob Clark , Rif A. Saurous

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently…

Computation and Language · Computer Science 2019-01-04 Ye Jia , Yu Zhang , Ron J. Weiss , Quan Wang , Jonathan Shen , Fei Ren , Zhifeng Chen , Patrick Nguyen , Ruoming Pang , Ignacio Lopez Moreno , Yonghui Wu

Fitting New Speakers Based on a Short Untranscribed Sample

Learning-based Text To Speech systems have the potential to generalize from one speaker to the next and thus require a relatively short sample of any new voice. However, this promise is currently largely unrealized. We present a method that…

Machine Learning · Computer Science 2018-02-21 Eliya Nachmani , Adam Polyak , Yaniv Taigman , Lior Wolf

Speech to Speech Synthesis for Voice Impersonation

Numerous models have shown great success in the fields of speech recognition as well as speech synthesis, but models for speech to speech processing have not been heavily explored. We propose Speech to Speech Synthesis Network (STSSN), a…

Sound · Computer Science 2026-02-20 Bjorn Johnson , Jared Levy

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to…

Audio and Speech Processing · Electrical Eng. & Systems 2024-12-17 Leying Zhang , Yao Qian , Long Zhou , Shujie Liu , Dongmei Wang , Xiaofei Wang , Midia Yousefi , Yanmin Qian , Jinyu Li , Lei He , Sheng Zhao , Michael Zeng

Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning

Deep learning models are becoming predominant in many fields of machine learning. Text-to-Speech (TTS), the process of synthesizing artificial speech from text, is no exception. To this end, a deep neural network is usually trained using a…

Sound · Computer Science 2021-02-11 Giuseppe Ruggiero , Enrico Zovato , Luigi Di Caro , Vincent Pollet

Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech

Several recently proposed text-to-speech (TTS) models achieved to generate the speech samples with the human-level quality in the single-speaker and multi-speaker TTS scenarios with a set of pre-defined speakers. However, synthesizing a new…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-23 Byoung Jin Choi , Myeonghun Jeong , Minchan Kim , Sung Hwan Mun , Nam Soo Kim

PromptSpeaker: Speaker Generation Based on Text Descriptions

Recently, text-guided content generation has received extensive attention. In this work, we explore the possibility of text description-based speaker generation, i.e., using text prompts to control the speaker generation process.…

Sound · Computer Science 2023-10-10 Yongmao Zhang , Guanghou Liu , Yi Lei , Yunlin Chen , Hao Yin , Lei Xie , Zhifei Li

An End-to-End Multi-Module Audio Deepfake Generation System for ADD Challenge 2023

The task of synthetic speech generation is to generate language content from a given text, then simulating fake human voice.The key factors that determine the effect of synthetic speech generation mainly include speed of generation,…

Sound · Computer Science 2023-07-04 Sheng Zhao , Qilong Yuan , Yibo Duan , Zhuoyue Chen

Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron

In recent years, several text-to-speech systems have been proposed to synthesize natural speech in zero-shot, few-shot, and low-resource scenarios. However, these methods typically require training with data from many different speakers.…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-03 Kishor Kayyar Lakshminarayana , Frank Zalkow , Christian Dittmar , Nicola Pia , Emanuel A. P. Habets

Sample Efficient Adaptive Text-to-Speech

We present a meta-learning approach for adaptive text-to-speech (TTS) with few data. During training, we learn a multi-speaker model using a shared conditional WaveNet core and independent learned embeddings for each speaker. The aim of…

Machine Learning · Computer Science 2019-01-18 Yutian Chen , Yannis Assael , Brendan Shillingford , David Budden , Scott Reed , Heiga Zen , Quan Wang , Luis C. Cobo , Andrew Trask , Ben Laurie , Caglar Gulcehre , Aäron van den Oord , Oriol Vinyals , Nando de Freitas

Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

Transfer tasks in text-to-speech (TTS) synthesis - where one or more aspects of the speech of one set of speakers is transferred to another set of speakers that do not feature these aspects originally - remains a challenging task. One of…

Sound · Computer Science 2022-08-30 Lev Finkelstein , Heiga Zen , Norman Casagrande , Chun-an Chan , Ye Jia , Tom Kenter , Alexey Petelin , Jonathan Shen , Vincent Wan , Yu Zhang , Yonghui Wu , Rob Clark

Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention

With recent advancements in voice cloning, the performance of speech synthesis for a target speaker has been rendered similar to the human level. However, autoregressive voice cloning systems still suffer from text alignment failures,…

Audio and Speech Processing · Electrical Eng. & Systems 2022-01-27 Artem Gorodetskii , Ivan Ozhiganov

Meta Learning Text-to-Speech Synthesis in over 7000 Languages

In this work, we take on the challenging task of building a single text-to-speech synthesis system that is capable of generating speech in over 7000 languages, many of which lack sufficient data for traditional TTS development. By…

Computation and Language · Computer Science 2024-06-11 Florian Lux , Sarina Meyer , Lyonel Behringer , Frank Zalkow , Phat Do , Matt Coler , Emanuël A. P. Habets , Ngoc Thang Vu

Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis

Although end-to-end text-to-speech (TTS) models such as Tacotron have shown excellent results, they typically require a sizable set of high-quality <text, audio> pairs for training, which are expensive to collect. In this paper, we propose…

Computation and Language · Computer Science 2018-08-31 Yu-An Chung , Yuxuan Wang , Wei-Ning Hsu , Yu Zhang , RJ Skerry-Ryan

Learning Speaker-specific Lip-to-Speech Generation

Understanding the lip movement and inferring the speech from it is notoriously difficult for the common person. The task of accurate lip-reading gets help from various cues of the speaker and its contextual or environmental setting. Every…

Computer Vision and Pattern Recognition · Computer Science 2022-08-23 Munender Varshney , Ravindra Yadav , Vinay P. Namboodiri , Rajesh M Hegde

A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI

Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction. With the diffusion model as the most popular generative model, numerous works have attempted two active…

Sound · Computer Science 2023-04-04 Chenshuang Zhang , Chaoning Zhang , Sheng Zheng , Mengchun Zhang , Maryam Qamar , Sung-Ho Bae , In So Kweon

Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition

Currently, a common approach in many speech processing tasks is to leverage large scale pre-trained models by fine-tuning them on in-domain data for a particular application. Yet obtaining even a small amount of such data can be…

Audio and Speech Processing · Electrical Eng. & Systems 2024-08-20 Samuele Cornell , Jordan Darefsky , Zhiyao Duan , Shinji Watanabe