Related papers: Creating New Voices using Normalizing Flows

Text-free non-parallel many-to-many voice conversion using normalising flows

Non-parallel voice conversion (VC) is typically achieved using lossy representations of the source speech. However, ensuring only speaker identity information is dropped whilst all other information from the source speech is retained is a…

Audio and Speech Processing · Electrical Eng. & Systems 2022-03-16 Thomas Merritt , Abdelhamid Ezzerg , Piotr Biliński , Magdalena Proszewska , Kamil Pokora , Roberto Barra-Chicote , Daniel Korzekwa

DS-TTS: Zero-Shot Speaker Style Adaptation from Voice Clips via Dynamic Dual-Style Feature Modulation

Recent advancements in text-to-speech (TTS) technology have increased demand for personalized audio synthesis. Zero-shot voice cloning, a specialized TTS task, aims to synthesize a target speaker's voice using only a single audio sample and…

Sound · Computer Science 2025-06-03 Ming Meng , Ziyi Yang , Jian Yang , Zhenjie Su , Yonggui Zhu , Zhaoxin Fan

O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion

Traditional voice conversion (VC) methods typically attempt to separate speaker identity and linguistic information into distinct representations, which are then combined to reconstruct the audio. However, effectively disentangling these…

Sound · Computer Science 2025-10-13 Huu Tuong Tu , Huan Vu , cuong tien nguyen , Dien Hy Ngo , Nguyen Thi Thu Trang

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality…

Sound · Computer Science 2024-08-28 Cheng Gong , Xin Wang , Erica Cooper , Dan Wells , Longbiao Wang , Jianwu Dang , Korin Richmond , Junichi Yamagishi

Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech

Zero-shot Text-to-Speech (TTS) has recently advanced significantly, enabling models to synthesize speech from text using short, limited-context prompts. These prompts serve as voice exemplars, allowing the model to mimic speaker identity,…

Sound · Computer Science 2025-10-06 Hieu-Nghia Huynh-Nguyen , Huynh Nguyen Dang , Ngoc-Son Nguyen , Van Nguyen

Faces that Speak: Jointly Synthesising Talking Face and Speech from Text

The goal of this work is to simultaneously generate natural talking faces and speech outputs from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We address the…

Computer Vision and Pattern Recognition · Computer Science 2024-05-17 Youngjoon Jang , Ji-Hoon Kim , Junseok Ahn , Doyeop Kwak , Hong-Sun Yang , Yoon-Cheol Ju , Il-Hwan Kim , Byeong-Yeol Kim , Joon Son Chung

StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models

One-shot voice conversion (VC) aims to convert speech from any source speaker to an arbitrary target speaker with only a few seconds of reference speech from the target speaker. This relies heavily on disentangling the speaker's identity…

Audio and Speech Processing · Electrical Eng. & Systems 2023-01-02 Yinghao Aaron Li , Cong Han , Nima Mesgarani

An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS

Recently, zero-shot text-to-speech (TTS) systems, capable of synthesizing any speaker's voice from a short audio prompt, have made rapid advancements. However, the quality of the generated speech significantly deteriorates when the audio…

Audio and Speech Processing · Electrical Eng. & Systems 2024-06-11 Xiaofei Wang , Sefik Emre Eskimez , Manthan Thakker , Hemin Yang , Zirun Zhu , Min Tang , Yufei Xia , Jinzhu Li , Sheng Zhao , Jinyu Li , Naoyuki Kanda

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

One-shot voice cloning aims to transform speaker voice and speaking style in speech synthesized from a text-to-speech (TTS) system, where only a shot recording from the target reference speech can be used. Out-of-domain transfer is still a…

Sound · Computer Science 2022-02-25 Rui Li , Dong Pu , Minnie Huang , Bill Huang

Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations

Zero-shot multi-speaker TTS aims to synthesize speech with the voice of a chosen target speaker without any fine-tuning. Prevailing methods, however, encounter limitations at adapting to new speakers of out-of-domain settings, primarily due…

Sound · Computer Science 2024-03-06 Yejin Jeon , Yunsu Kim , Gary Geunbae Lee

vTTS: visual-text to speech

This paper proposes visual-text to speech (vTTS), a method for synthesizing speech from visual text (i.e., text as an image). Conventional TTS converts phonemes or characters into discrete symbols and synthesizes a speech waveform from…

Sound · Computer Science 2022-03-29 Yoshifumi Nakano , Takaaki Saeki , Shinnosuke Takamichi , Katsuhito Sudoh , Hiroshi Saruwatari

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Given a piece of speech and its transcript text, text-based speech editing aims to generate speech that can be seamlessly inserted into the given speech by editing the transcript. Existing methods adopt a two-stage approach: synthesize the…

Sound · Computer Science 2021-09-14 Chuanxin Tang , Chong Luo , Zhiyuan Zhao , Dacheng Yin , Yucheng Zhao , Wenjun Zeng

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently…

Computation and Language · Computer Science 2019-01-04 Ye Jia , Yu Zhang , Ron J. Weiss , Quan Wang , Jonathan Shen , Fei Ren , Zhifeng Chen , Patrick Nguyen , Ruoming Pang , Ignacio Lopez Moreno , Yonghui Wu

Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model

This paper proposes a zero-shot text-to-speech (TTS) conditioned by a self-supervised speech-representation model acquired through self-supervised learning (SSL). Conventional methods with embedding vectors from x-vector or global style…

Sound · Computer Science 2023-12-19 Kenichi Fujita , Takanori Ashihara , Hiroki Kanagawa , Takafumi Moriya , Yusuke Ijima

Expressive Neural Voice Cloning

Voice cloning is the task of learning to synthesize the voice of an unseen speaker from a few samples. While current voice cloning methods achieve promising results in Text-to-Speech (TTS) synthesis for a new voice, these approaches lack…

Sound · Computer Science 2021-02-02 Paarth Neekhara , Shehzeen Hussain , Shlomo Dubnov , Farinaz Koushanfar , Julian McAuley

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language…

Sound · Computer Science 2024-04-10 Shun Lei , Yixuan Zhou , Liyang Chen , Dan Luo , Zhiyong Wu , Xixin Wu , Shiyin Kang , Tao Jiang , Yahui Zhou , Yuxing Han , Helen Meng

VoiceMe: Personalized voice generation in TTS

Novel text-to-speech systems can generate entirely new voices that were not seen during training. However, it remains a difficult task to efficiently create personalized voices from a high-dimensional speaker space. In this work, we use…

Sound · Computer Science 2022-07-12 Pol van Rijn , Silvan Mertes , Dominik Schiller , Piotr Dura , Hubert Siuzdak , Peter M. C. Harrison , Elisabeth André , Nori Jacoby

Voice Filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module

State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data to generate high-quality synthetic speech. When using reduced amounts of training data, standard TTS models suffer from speech quality and…

Audio and Speech Processing · Electrical Eng. & Systems 2022-02-17 Adam Gabryś , Goeric Huybrechts , Manuel Sam Ribeiro , Chung-Ming Chien , Julian Roth , Giulia Comini , Roberto Barra-Chicote , Bartek Perz , Jaime Lorenzo-Trueba

A New Approach to Voice Authenticity

Voice faking, driven primarily by recent advances in text-to-speech (TTS) synthesis technology, poses significant societal challenges. Currently, the prevailing assumption is that unaltered human speech can be considered genuine, while fake…

Sound · Computer Science 2024-02-12 Nicolas M. Müller , Piotr Kawa , Shen Hu , Matthias Neu , Jennifer Williams , Philip Sperl , Konstantin Böttinger

Bootstrapping non-parallel voice conversion from speaker-adaptive text-to-speech

Voice conversion (VC) and text-to-speech (TTS) are two tasks that share a similar objective, generating speech with a target voice. However, they are usually developed independently under vastly different frameworks. In this paper, we…

Audio and Speech Processing · Electrical Eng. & Systems 2019-09-17 Hieu-Thi Luong , Junichi Yamagishi