English
Related papers

Related papers: Multi-GradSpeech: Towards Diffusion-based Multi-Sp…

200 papers

Neural Text-to-Speech (TTS) systems find broad applications in voice assistants, e-learning, and audiobook creation. The pursuit of modern models, like Diffusion Models (DMs), holds promise for achieving high-fidelity, real-time speech…

Sound · Computer Science 2024-04-02 Xiang Li , Fan Bu , Ambuj Mehrish , Yingting Li , Jiale Han , Bo Cheng , Soujanya Poria

There has been a significant progress in Text-To-Speech (TTS) synthesis technology in recent years, thanks to the advancement in neural generative modeling. However, existing methods on any-speaker adaptive TTS have achieved unsatisfactory…

Audio and Speech Processing · Electrical Eng. & Systems 2023-03-15 Minki Kang , Dongchan Min , Sung Ju Hwang

Recently, denoising diffusion probabilistic models and generative score matching have shown high potential in modelling complex data distributions while stochastic calculus has provided a unified point of view on these techniques allowing…

Machine Learning · Computer Science 2021-08-06 Vadim Popov , Ivan Vovk , Vladimir Gogoryan , Tasnima Sadekova , Mikhail Kudinov

Text-to-Speech (TTS) models have advanced significantly, aiming to accurately replicate human speech's diversity, including unique speaker identities and linguistic nuances. Despite these advancements, achieving an optimal balance between…

Audio and Speech Processing · Electrical Eng. & Systems 2024-08-28 Jinhyeok Yang , Junhyeok Lee , Hyeong-Seok Choi , Seunghun Ji , Hyeongju Kim , Juheon Lee

We propose Guided-TTS, a high-quality text-to-speech (TTS) model that does not require any transcript of target speaker using classifier guidance. Guided-TTS combines an unconditional diffusion probabilistic model with a separately trained…

Sound · Computer Science 2022-06-13 Heeseung Kim , Sungwon Kim , Sungroh Yoon

Diffusion models have demonstrated significant potential in speech synthesis tasks, including text-to-speech (TTS) and voice cloning. However, their iterative denoising processes are computationally intensive, and previous distillation…

Audio and Speech Processing · Electrical Eng. & Systems 2025-02-21 Yingahao Aaron Li , Rithesh Kumar , Zeyu Jin

Unified architectures in multimodal large language models (MLLM) have shown promise in handling diverse tasks within a single framework. In the text-to-speech (TTS) task, current MLLM-based approaches rely on discrete token representations,…

Audio and Speech Processing · Electrical Eng. & Systems 2025-10-27 Xinlu He , Swayambhu Nath Ray , Harish Mallidi , Jia-Hong Huang , Ashwin Bellur , Chander Chandak , M. Maruf , Venkatesh Ravichandran

Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with minimal supervision by combining two types of discrete speech representations and using two sequence-to-sequence tasks to decouple TTS.…

Sound · Computer Science 2023-12-19 Chunyu Qiang , Hao Li , Hao Ni , He Qu , Ruibo Fu , Tao Wang , Longbiao Wang , Jianwu Dang

Scaling text-to-speech (TTS) with autoregressive language model (LM) to large-scale datasets by quantizing waveform into discrete speech tokens is making great progress to capture the diversity and expressiveness in human speech, but the…

Audio and Speech Processing · Electrical Eng. & Systems 2024-06-10 Chong Zhang , Yanqing Liu , Yang Zheng , Sheng Zhao

In the Text-to-speech(TTS) task, the latent diffusion model has excellent fidelity and generalization, but its expensive resource consumption and slow inference speed have always been a challenging. This paper proposes Discrete Diffusion…

Sound · Computer Science 2023-09-14 Zhichao Wu , Qiulin Li , Sixing Liu , Qun Yang

Recent advances in neural text-to-speech (TTS) models bring thousands of TTS applications into daily life, where models are deployed in cloud to provide services for customs. Among these models are diffusion probabilistic models (DPMs),…

Sound · Computer Science 2023-09-01 Jie Chen , Xingchen Song , Zhendong Peng , Binbin Zhang , Fuping Pan , Zhiyong Wu

With the development of deep learning, speech enhancement has been greatly optimized in terms of speech quality. Previous methods typically focus on the discriminative supervised learning or generative modeling, which tends to introduce…

Audio and Speech Processing · Electrical Eng. & Systems 2025-10-31 Nan Xu , Zhaolong Huang , Xiaonan Zhi

Deep learning has led to considerable advances in text-to-speech synthesis. Most recently, the adoption of Score-based Generative Models (SGMs), also known as Diffusion Probabilistic Models (DPMs), has gained traction due to their ability…

Human speech exhibits rich and flexible prosodic variations. To address the one-to-many mapping problem from text to prosody in a reasonable and flexible manner, we propose DiffStyleTTS, a multi-speaker acoustic model based on a conditional…

Sound · Computer Science 2024-12-05 Jiaxuan Liu , Zhaoci Liu , Yajun Hu , Yingying Gao , Shilei Zhang , Zhenhua Ling

Large-scale latent diffusion models (LDMs) excel in content generation across various modalities, but their reliance on phonemes and durations in text-to-speech (TTS) limits scalability and access from other fields. While recent studies…

Audio and Speech Processing · Electrical Eng. & Systems 2025-02-18 Keon Lee , Dong Won Kim , Jaehyeon Kim , Seungjun Chung , Jaewoong Cho

While Diffusion Generative Models have achieved great success on image generation tasks, how to efficiently and effectively incorporate them into speech generation especially translation tasks remains a non-trivial problem. Specifically,…

Computation and Language · Computer Science 2023-10-27 Yongxin Zhu , Zhujin Gao , Xinyuan Zhou , Zhongyi Ye , Linli Xu

Recently, the application of diffusion probabilistic models has advanced speech enhancement through generative approaches. However, existing diffusion-based methods have focused on the generation process in high-dimensional waveform or…

Sound · Computer Science 2025-01-20 Shengkui Zhao , Zexu Pan , Kun Zhou , Yukun Ma , Chong Zhang , Bin Ma

Transformer-based text to speech (TTS) model (e.g., Transformer TTS~\cite{li2019neural}, FastSpeech~\cite{ren2019fastspeech}) has shown the advantages of training and inference efficiency over RNN-based model (e.g.,…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-04 Mingjian Chen , Xu Tan , Yi Ren , Jin Xu , Hao Sun , Sheng Zhao , Tao Qin , Tie-Yan Liu

The rapid development of large-scale text-to-speech (TTS) models has led to significant advancements in modeling diverse speaker prosody and voices. However, these models often face issues such as slow inference speeds, reliance on complex…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-17 Yinghao Aaron Li , Xilin Jiang , Cong Han , Nima Mesgarani

Diffusion-based Generative AI gains significant attention for its superior performance over other generative techniques like Generative Adversarial Networks and Variational Autoencoders. While it has achieved notable advancements in fields…

Sound · Computer Science 2024-12-12 Haowei Lou , Helen Paik , Pari Delir Haghighi , Wen Hu , Lina Yao
‹ Prev 1 2 3 10 Next ›