English
Related papers

Related papers: Speaker Generation

200 papers

Currently, many multi-speaker speech synthesis and voice conversion systems address speaker variations with an embedding vector. Modeling it directly allows new voices outside of training data to be synthesized. GMM based approaches such as…

Sound · Computer Science 2023-09-26 Yao Shi , Ming Li

In this paper, we propose a method for intermediating multiple speakers' attributes and diversifying their voice characteristics in ``speaker generation,'' an emerging task that aims to synthesize a nonexistent speaker's naturally sounding…

Sound · Computer Science 2022-10-19 Aya Watanabe , Shinnosuke Takamichi , Yuki Saito , Detai Xin , Hiroshi Saruwatari

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain…

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently…

Computation and Language · Computer Science 2019-01-04 Ye Jia , Yu Zhang , Ron J. Weiss , Quan Wang , Jonathan Shen , Fei Ren , Zhifeng Chen , Patrick Nguyen , Ruoming Pang , Ignacio Lopez Moreno , Yonghui Wu

Learning-based Text To Speech systems have the potential to generalize from one speaker to the next and thus require a relatively short sample of any new voice. However, this promise is currently largely unrealized. We present a method that…

Machine Learning · Computer Science 2018-02-21 Eliya Nachmani , Adam Polyak , Yaniv Taigman , Lior Wolf

Numerous models have shown great success in the fields of speech recognition as well as speech synthesis, but models for speech to speech processing have not been heavily explored. We propose Speech to Speech Synthesis Network (STSSN), a…

Sound · Computer Science 2026-02-20 Bjorn Johnson , Jared Levy

Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to…

Audio and Speech Processing · Electrical Eng. & Systems 2024-12-17 Leying Zhang , Yao Qian , Long Zhou , Shujie Liu , Dongmei Wang , Xiaofei Wang , Midia Yousefi , Yanmin Qian , Jinyu Li , Lei He , Sheng Zhao , Michael Zeng

Deep learning models are becoming predominant in many fields of machine learning. Text-to-Speech (TTS), the process of synthesizing artificial speech from text, is no exception. To this end, a deep neural network is usually trained using a…

Sound · Computer Science 2021-02-11 Giuseppe Ruggiero , Enrico Zovato , Luigi Di Caro , Vincent Pollet

Several recently proposed text-to-speech (TTS) models achieved to generate the speech samples with the human-level quality in the single-speaker and multi-speaker TTS scenarios with a set of pre-defined speakers. However, synthesizing a new…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-23 Byoung Jin Choi , Myeonghun Jeong , Minchan Kim , Sung Hwan Mun , Nam Soo Kim

Recently, text-guided content generation has received extensive attention. In this work, we explore the possibility of text description-based speaker generation, i.e., using text prompts to control the speaker generation process.…

Sound · Computer Science 2023-10-10 Yongmao Zhang , Guanghou Liu , Yi Lei , Yunlin Chen , Hao Yin , Lei Xie , Zhifei Li

The task of synthetic speech generation is to generate language content from a given text, then simulating fake human voice.The key factors that determine the effect of synthetic speech generation mainly include speed of generation,…

Sound · Computer Science 2023-07-04 Sheng Zhao , Qilong Yuan , Yibo Duan , Zhuoyue Chen

In recent years, several text-to-speech systems have been proposed to synthesize natural speech in zero-shot, few-shot, and low-resource scenarios. However, these methods typically require training with data from many different speakers.…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-03 Kishor Kayyar Lakshminarayana , Frank Zalkow , Christian Dittmar , Nicola Pia , Emanuel A. P. Habets

We present a meta-learning approach for adaptive text-to-speech (TTS) with few data. During training, we learn a multi-speaker model using a shared conditional WaveNet core and independent learned embeddings for each speaker. The aim of…

Transfer tasks in text-to-speech (TTS) synthesis - where one or more aspects of the speech of one set of speakers is transferred to another set of speakers that do not feature these aspects originally - remains a challenging task. One of…

With recent advancements in voice cloning, the performance of speech synthesis for a target speaker has been rendered similar to the human level. However, autoregressive voice cloning systems still suffer from text alignment failures,…

Audio and Speech Processing · Electrical Eng. & Systems 2022-01-27 Artem Gorodetskii , Ivan Ozhiganov

In this work, we take on the challenging task of building a single text-to-speech synthesis system that is capable of generating speech in over 7000 languages, many of which lack sufficient data for traditional TTS development. By…

Computation and Language · Computer Science 2024-06-11 Florian Lux , Sarina Meyer , Lyonel Behringer , Frank Zalkow , Phat Do , Matt Coler , Emanuël A. P. Habets , Ngoc Thang Vu

Although end-to-end text-to-speech (TTS) models such as Tacotron have shown excellent results, they typically require a sizable set of high-quality <text, audio> pairs for training, which are expensive to collect. In this paper, we propose…

Computation and Language · Computer Science 2018-08-31 Yu-An Chung , Yuxuan Wang , Wei-Ning Hsu , Yu Zhang , RJ Skerry-Ryan

Understanding the lip movement and inferring the speech from it is notoriously difficult for the common person. The task of accurate lip-reading gets help from various cues of the speaker and its contextual or environmental setting. Every…

Computer Vision and Pattern Recognition · Computer Science 2022-08-23 Munender Varshney , Ravindra Yadav , Vinay P. Namboodiri , Rajesh M Hegde

Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction. With the diffusion model as the most popular generative model, numerous works have attempted two active…

Currently, a common approach in many speech processing tasks is to leverage large scale pre-trained models by fine-tuning them on in-domain data for a particular application. Yet obtaining even a small amount of such data can be…

Audio and Speech Processing · Electrical Eng. & Systems 2024-08-20 Samuele Cornell , Jordan Darefsky , Zhiyao Duan , Shinji Watanabe
‹ Prev 1 2 3 10 Next ›