English
Related papers

Related papers: Creating New Voices using Normalizing Flows

200 papers

Non-parallel voice conversion (VC) is typically achieved using lossy representations of the source speech. However, ensuring only speaker identity information is dropped whilst all other information from the source speech is retained is a…

Audio and Speech Processing · Electrical Eng. & Systems 2022-03-16 Thomas Merritt , Abdelhamid Ezzerg , Piotr Biliński , Magdalena Proszewska , Kamil Pokora , Roberto Barra-Chicote , Daniel Korzekwa

Recent advancements in text-to-speech (TTS) technology have increased demand for personalized audio synthesis. Zero-shot voice cloning, a specialized TTS task, aims to synthesize a target speaker's voice using only a single audio sample and…

Sound · Computer Science 2025-06-03 Ming Meng , Ziyi Yang , Jian Yang , Zhenjie Su , Yonggui Zhu , Zhaoxin Fan

Traditional voice conversion (VC) methods typically attempt to separate speaker identity and linguistic information into distinct representations, which are then combined to reconstruct the audio. However, effectively disentangling these…

Sound · Computer Science 2025-10-13 Huu Tuong Tu , Huan Vu , cuong tien nguyen , Dien Hy Ngo , Nguyen Thi Thu Trang

Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality…

Zero-shot Text-to-Speech (TTS) has recently advanced significantly, enabling models to synthesize speech from text using short, limited-context prompts. These prompts serve as voice exemplars, allowing the model to mimic speaker identity,…

Sound · Computer Science 2025-10-06 Hieu-Nghia Huynh-Nguyen , Huynh Nguyen Dang , Ngoc-Son Nguyen , Van Nguyen

The goal of this work is to simultaneously generate natural talking faces and speech outputs from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We address the…

Computer Vision and Pattern Recognition · Computer Science 2024-05-17 Youngjoon Jang , Ji-Hoon Kim , Junseok Ahn , Doyeop Kwak , Hong-Sun Yang , Yoon-Cheol Ju , Il-Hwan Kim , Byeong-Yeol Kim , Joon Son Chung

One-shot voice conversion (VC) aims to convert speech from any source speaker to an arbitrary target speaker with only a few seconds of reference speech from the target speaker. This relies heavily on disentangling the speaker's identity…

Audio and Speech Processing · Electrical Eng. & Systems 2023-01-02 Yinghao Aaron Li , Cong Han , Nima Mesgarani

Recently, zero-shot text-to-speech (TTS) systems, capable of synthesizing any speaker's voice from a short audio prompt, have made rapid advancements. However, the quality of the generated speech significantly deteriorates when the audio…

Audio and Speech Processing · Electrical Eng. & Systems 2024-06-11 Xiaofei Wang , Sefik Emre Eskimez , Manthan Thakker , Hemin Yang , Zirun Zhu , Min Tang , Yufei Xia , Jinzhu Li , Sheng Zhao , Jinyu Li , Naoyuki Kanda

One-shot voice cloning aims to transform speaker voice and speaking style in speech synthesized from a text-to-speech (TTS) system, where only a shot recording from the target reference speech can be used. Out-of-domain transfer is still a…

Sound · Computer Science 2022-02-25 Rui Li , Dong Pu , Minnie Huang , Bill Huang

Zero-shot multi-speaker TTS aims to synthesize speech with the voice of a chosen target speaker without any fine-tuning. Prevailing methods, however, encounter limitations at adapting to new speakers of out-of-domain settings, primarily due…

Sound · Computer Science 2024-03-06 Yejin Jeon , Yunsu Kim , Gary Geunbae Lee

This paper proposes visual-text to speech (vTTS), a method for synthesizing speech from visual text (i.e., text as an image). Conventional TTS converts phonemes or characters into discrete symbols and synthesizes a speech waveform from…

Given a piece of speech and its transcript text, text-based speech editing aims to generate speech that can be seamlessly inserted into the given speech by editing the transcript. Existing methods adopt a two-stage approach: synthesize the…

Sound · Computer Science 2021-09-14 Chuanxin Tang , Chong Luo , Zhiyuan Zhao , Dacheng Yin , Yucheng Zhao , Wenjun Zeng

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently…

Computation and Language · Computer Science 2019-01-04 Ye Jia , Yu Zhang , Ron J. Weiss , Quan Wang , Jonathan Shen , Fei Ren , Zhifeng Chen , Patrick Nguyen , Ruoming Pang , Ignacio Lopez Moreno , Yonghui Wu

This paper proposes a zero-shot text-to-speech (TTS) conditioned by a self-supervised speech-representation model acquired through self-supervised learning (SSL). Conventional methods with embedding vectors from x-vector or global style…

Sound · Computer Science 2023-12-19 Kenichi Fujita , Takanori Ashihara , Hiroki Kanagawa , Takafumi Moriya , Yusuke Ijima

Voice cloning is the task of learning to synthesize the voice of an unseen speaker from a few samples. While current voice cloning methods achieve promising results in Text-to-Speech (TTS) synthesis for a new voice, these approaches lack…

Sound · Computer Science 2021-02-02 Paarth Neekhara , Shehzeen Hussain , Shlomo Dubnov , Farinaz Koushanfar , Julian McAuley

Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language…

Novel text-to-speech systems can generate entirely new voices that were not seen during training. However, it remains a difficult task to efficiently create personalized voices from a high-dimensional speaker space. In this work, we use…

State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data to generate high-quality synthetic speech. When using reduced amounts of training data, standard TTS models suffer from speech quality and…

Audio and Speech Processing · Electrical Eng. & Systems 2022-02-17 Adam Gabryś , Goeric Huybrechts , Manuel Sam Ribeiro , Chung-Ming Chien , Julian Roth , Giulia Comini , Roberto Barra-Chicote , Bartek Perz , Jaime Lorenzo-Trueba

Voice faking, driven primarily by recent advances in text-to-speech (TTS) synthesis technology, poses significant societal challenges. Currently, the prevailing assumption is that unaltered human speech can be considered genuine, while fake…

Voice conversion (VC) and text-to-speech (TTS) are two tasks that share a similar objective, generating speech with a target voice. However, they are usually developed independently under vastly different frameworks. In this paper, we…

Audio and Speech Processing · Electrical Eng. & Systems 2019-09-17 Hieu-Thi Luong , Junichi Yamagishi
‹ Prev 1 2 3 10 Next ›