Related papers: Multi-GradSpeech: Towards Diffusion-based Multi-Sp…

CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models

Neural Text-to-Speech (TTS) systems find broad applications in voice assistants, e-learning, and audiobook creation. The pursuit of modern models, like Diffusion Models (DMs), holds promise for achieving high-fidelity, real-time speech…

Sound · Computer Science 2024-04-02 Xiang Li , Fan Bu , Ambuj Mehrish , Yingting Li , Jiale Han , Bo Cheng , Soujanya Poria

Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models

There has been a significant progress in Text-To-Speech (TTS) synthesis technology in recent years, thanks to the advancement in neural generative modeling. However, existing methods on any-speaker adaptive TTS have achieved unsatisfactory…

Audio and Speech Processing · Electrical Eng. & Systems 2023-03-15 Minki Kang , Dongchan Min , Sung Ju Hwang

Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

Recently, denoising diffusion probabilistic models and generative score matching have shown high potential in modelling complex data distributions while stochastic calculus has provided a unified point of view on these techniques allowing…

Machine Learning · Computer Science 2021-08-06 Vadim Popov , Ivan Vovk , Vladimir Gogoryan , Tasnima Sadekova , Mikhail Kudinov

DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance

Text-to-Speech (TTS) models have advanced significantly, aiming to accurately replicate human speech's diversity, including unique speaker identities and linguistic nuances. Despite these advancements, achieving an optimal balance between…

Audio and Speech Processing · Electrical Eng. & Systems 2024-08-28 Jinhyeok Yang , Junhyeok Lee , Hyeong-Seok Choi , Seunghun Ji , Hyeongju Kim , Juheon Lee

Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance

We propose Guided-TTS, a high-quality text-to-speech (TTS) model that does not require any transcript of target speaker using classifier guidance. Guided-TTS combines an unconditional diffusion probabilistic model with a separately trained…

Sound · Computer Science 2022-06-13 Heeseung Kim , Sungwon Kim , Sungroh Yoon

DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis

Diffusion models have demonstrated significant potential in speech synthesis tasks, including text-to-speech (TTS) and voice cloning. However, their iterative denoising processes are computationally intensive, and previous distillation…

Audio and Speech Processing · Electrical Eng. & Systems 2025-02-21 Yingahao Aaron Li , Rithesh Kumar , Zeyu Jin

Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs

Unified architectures in multimodal large language models (MLLM) have shown promise in handling diverse tasks within a single framework. In the text-to-speech (TTS) task, current MLLM-based approaches rely on discrete token representations,…

Audio and Speech Processing · Electrical Eng. & Systems 2025-10-27 Xinlu He , Swayambhu Nath Ray , Harish Mallidi , Jia-Hong Huang , Ashwin Bellur , Chander Chandak , M. Maruf , Venkatesh Ravichandran

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with minimal supervision by combining two types of discrete speech representations and using two sequence-to-sequence tasks to decouple TTS.…

Sound · Computer Science 2023-12-19 Chunyu Qiang , Hao Li , Hao Ni , He Qu , Ruibo Fu , Tao Wang , Longbiao Wang , Jianwu Dang

Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study

Scaling text-to-speech (TTS) with autoregressive language model (LM) to large-scale datasets by quantizing waveform into discrete speech tokens is making great progress to capture the diversity and expressiveness in human speech, but the…

Audio and Speech Processing · Electrical Eng. & Systems 2024-06-10 Chong Zhang , Yanqing Liu , Yang Zheng , Sheng Zhao

DCTTS: Discrete Diffusion Model with Contrastive Learning for Text-to-speech Generation

In the Text-to-speech(TTS) task, the latent diffusion model has excellent fidelity and generalization, but its expensive resource consumption and slow inference speed have always been a challenging. This paper proposes Discrete Diffusion…

Sound · Computer Science 2023-09-14 Zhichao Wu , Qiulin Li , Sixing Liu , Qun Yang

LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech

Recent advances in neural text-to-speech (TTS) models bring thousands of TTS applications into daily life, where models are deployed in cloud to provide services for customs. Among these models are diffusion probabilistic models (DPMs),…

Sound · Computer Science 2023-09-01 Jie Chen , Xingchen Song , Zhendong Peng , Binbin Zhang , Fuping Pan , Zhiyong Wu

MDDM: A Multi-view Discriminative Enhanced Diffusion-based Model for Speech Enhancement

With the development of deep learning, speech enhancement has been greatly optimized in terms of speech quality. Previous methods typically focus on the discriminative supervised learning or generative modeling, which tends to introduce…

Audio and Speech Processing · Electrical Eng. & Systems 2025-10-31 Nan Xu , Zhaolong Huang , Xiaonan Zhi

U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech

Deep learning has led to considerable advances in text-to-speech synthesis. Most recently, the adoption of Score-based Generative Models (SGMs), also known as Diffusion Probabilistic Models (DPMs), has gained traction due to their ability…

Sound · Computer Science 2023-05-23 Xin Jing , Yi Chang , Zijiang Yang , Jiangjian Xie , Andreas Triantafyllopoulos , Bjoern W. Schuller

DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles

Human speech exhibits rich and flexible prosodic variations. To address the one-to-many mapping problem from text to prosody in a reasonable and flexible manner, we propose DiffStyleTTS, a multi-speaker acoustic model based on a conditional…

Sound · Computer Science 2024-12-05 Jiaxuan Liu , Zhaoci Liu , Yajun Hu , Yingying Gao , Shilei Zhang , Zhenhua Ling

DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors

Large-scale latent diffusion models (LDMs) excel in content generation across various modalities, but their reliance on phonemes and durations in text-to-speech (TTS) limits scalability and access from other fields. While recent studies…

Audio and Speech Processing · Electrical Eng. & Systems 2025-02-18 Keon Lee , Dong Won Kim , Jaehyeon Kim , Seungjun Chung , Jaewoong Cho

DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct Speech-to-Speech Translation

While Diffusion Generative Models have achieved great success on image generation tasks, how to efficiently and effectively incorporate them into speech generation especially translation tasks remains a non-trivial problem. Specifically,…

Computation and Language · Computer Science 2023-10-27 Yongxin Zhu , Zhujin Gao , Xinyuan Zhou , Zhongyi Ye , Linli Xu

Conditional Latent Diffusion-Based Speech Enhancement Via Dual Context Learning

Recently, the application of diffusion probabilistic models has advanced speech enhancement through generative approaches. However, existing diffusion-based methods have focused on the generation process in high-dimensional waveform or…

Sound · Computer Science 2025-01-20 Shengkui Zhao , Zexu Pan , Kun Zhou , Yukun Ma , Chong Zhang , Bin Ma

MultiSpeech: Multi-Speaker Text to Speech with Transformer

Transformer-based text to speech (TTS) model (e.g., Transformer TTS~\cite{li2019neural}, FastSpeech~\cite{ren2019fastspeech}) has shown the advantages of training and inference efficiency over RNN-based model (e.g.,…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-04 Mingjian Chen , Xu Tan , Yi Ren , Jin Xu , Hao Sun , Sheng Zhao , Tao Qin , Tie-Yan Liu

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

The rapid development of large-scale text-to-speech (TTS) models has led to significant advancements in modeling diverse speaker prosody and voices. However, these models often face issues such as slow inference speeds, reliance on complex…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-17 Yinghao Aaron Li , Xilin Jiang , Cong Han , Nima Mesgarani

LatentSpeech: Latent Diffusion for Text-To-Speech Generation

Diffusion-based Generative AI gains significant attention for its superior performance over other generative techniques like Generative Adversarial Networks and Variational Autoencoders. While it has achieved notable advancements in fields…

Sound · Computer Science 2024-12-12 Haowei Lou , Helen Paik , Pari Delir Haghighi , Wen Hu , Lina Yao