Related papers: Autoregressive Speech Synthesis without Vector Qua…

FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching

To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language…

Computation and Language · Computer Science 2025-09-04 Hui Wang , Shujie Liu , Lingwei Meng , Jinyu Li , Yifan Yang , Shiwan Zhao , Haiyang Sun , Yanqing Liu , Haoqin Sun , Jiaming Zhou , Yan Lu , Yong Qin

Bayesian Speech Synthesizers Can Learn from Multiple Teachers

Text-to-Speech (TTS) is inherently a "one-to-many" mapping characterized by intrinsic uncertainty, yet current paradigms often oversimplify it into a deterministic regression task. While continuous-valued autoregressive (AR) models have…

Sound · Computer Science 2026-02-11 Ziyang Zhang , Yifan Gao , Xuenan Xu , Baoxiang Li , Wen Wu , Chao Zhang

MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis

This work introduces MELA-TTS, a novel joint transformer-diffusion framework for end-to-end text-to-speech synthesis. By autoregressively generating continuous mel-spectrogram frames from linguistic and speaker conditions, our architecture…

Audio and Speech Processing · Electrical Eng. & Systems 2026-01-27 Keyu An , Zhiyu Zhang , Changfeng Gao , Yabin Li , Zhendong Peng , Haoxu Wang , Zhihao Du , Han Zhao , Zhifu Gao , Xiangang Li

KALL-E:Autoregressive Speech Synthesis with Next-Distribution Prediction

We introduce KALL-E, a novel autoregressive (AR) language model for text-to-speech (TTS) synthesis that operates by predicting the next distribution of continuous speech frames. Unlike existing methods, KALL-E directly models the continuous…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-18 Kangxiang Xia , Xinfa Zhu , Jixun Yao , Wenjie Tian , Wenhao Li , Lei Xie

Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis

We propose a novel autoregressive modeling approach for speech synthesis, combining a variational autoencoder (VAE) with a multi-modal latent space and an autoregressive model that uses Gaussian Mixture Models (GMM) as the conditional…

Machine Learning · Computer Science 2025-02-14 Weiwei Lin , Chenghan He

Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling

Existing Large Language Model (LLM) based autoregressive (AR) text-to-speech (TTS) systems, while achieving state-of-the-art quality, still face critical challenges. The foundation of this LLM-based paradigm is the discretization of the…

Sound · Computer Science 2025-09-29 Junjie Cao , Yichen Han , Ruonan Zhang , Xiaoyang Hao , Hongxiang Li , Shuaijiang Zhao , Yue Liu , Xiao-Ping Zhng

Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis

Recent zero-shot text-to-speech (TTS) systems face a common dilemma: autoregressive (AR) models suffer from slow generation and lack duration controllability, while non-autoregressive (NAR) models lack temporal modeling and typically…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-06 Yifan Yang , Shujie Liu , Jinyu Li , Yuxuan Hu , Haibin Wu , Hui Wang , Jianwei Yu , Lingwei Meng , Haiyang Sun , Yanqing Liu , Yan Lu , Kai Yu , Xie Chen

FastSpeech: Fast, Robust and Controllable Text to Speech

Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from the…

Computation and Language · Computer Science 2019-11-21 Yi Ren , Yangjun Ruan , Xu Tan , Tao Qin , Sheng Zhao , Zhou Zhao , Tie-Yan Liu

MELD: Mel-Spectrogram-Based Speech Language Modeling with Discrete Latent Variables

Recent speech language models rely on encoders that are optimized separately from autoregressive models. Since these encoders are unaware of the downstream objectives, the extracted representations may not be optimal for downstream tasks.…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-29 Sung-Lin Yeh , Wei Zhou , Gil Keren , Duc Le , Zhong Meng , Hao Tang , Jay Mahadeokar , Ozlem Kalinli , Alexandre Mourachko

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS…

Computation and Language · Computer Science 2023-01-06 Chengyi Wang , Sanyuan Chen , Yu Wu , Ziqiang Zhang , Long Zhou , Shujie Liu , Zhuo Chen , Yanqing Liu , Huaming Wang , Jinyu Li , Lei He , Sheng Zhao , Furu Wei

Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling

Zero-shot streaming text-to-speech is an important research topic in human-computer interaction. Existing methods primarily use a lookahead mechanism, relying on future text to achieve natural streaming speech synthesis, which introduces…

Machine Learning · Computer Science 2025-06-03 Haiyang Sun , Shujie Hu , Shujie Liu , Lingwei Meng , Hui Wang , Bing Han , Yifan Yang , Yanqing Liu , Sheng Zhao , Yan Lu , Yanmin Qian

CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech

With the emergence of neural audio codecs, which encode multiple streams of discrete tokens from audio, large language models have recently gained attention as a promising approach for zero-shot Text-to-Speech (TTS) synthesis. Despite the…

Audio and Speech Processing · Electrical Eng. & Systems 2024-04-04 Jaehyeon Kim , Keon Lee , Seungjun Chung , Jaewoong Cho

CLEAR: Continuous Latent Autoregressive Modeling for High-quality and Low-latency Speech Synthesis

Autoregressive (AR) language models have emerged as powerful solutions for zero-shot text-to-speech (TTS) synthesis, capable of generating natural speech from a few seconds of audio prompts. However, conventional AR-based TTS systems…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-27 Chun Yat Wu , Jiajun Deng , Guinan Li , Qiuqiang Kong , Simon Lui

VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis

This paper describes a variational auto-encoder based non-autoregressive text-to-speech (VAENAR-TTS) model. The autoregressive TTS (AR-TTS) models based on the sequence-to-sequence architecture can generate high-quality speech, but their…

Sound · Computer Science 2021-07-08 Hui Lu , Zhiyong Wu , Xixin Wu , Xu Li , Shiyin Kang , Xunying Liu , Helen Meng

Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model

Recent advances in generative language modeling applied to discrete speech tokens presented a new avenue for text-to-speech (TTS) synthesis. These speech language models (SLMs), similarly to their textual counterparts, are scalable,…

Audio and Speech Processing · Electrical Eng. & Systems 2024-05-17 Siyang Wang , Éva Székely

High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models

Text-to-speech (TTS) methods have shown promising results in voice cloning, but they require a large number of labeled text-speech pairs. Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech…

Sound · Computer Science 2023-12-19 Chunyu Qiang , Hao Li , Yixin Tian , Yi Zhao , Ying Zhang , Longbiao Wang , Jianwu Dang

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently…

Computation and Language · Computer Science 2019-01-04 Ye Jia , Yu Zhang , Ron J. Weiss , Quan Wang , Jonathan Shen , Fei Ren , Zhifeng Chen , Patrick Nguyen , Ruoming Pang , Ignacio Lopez Moreno , Yonghui Wu

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

With the help of discrete neural audio codecs, large language models (LLM) have increasingly been recognized as a promising methodology for zero-shot Text-to-Speech (TTS) synthesis. However, sampling based decoding strategies bring…

Computation and Language · Computer Science 2024-06-13 Bing Han , Long Zhou , Shujie Liu , Sanyuan Chen , Lingwei Meng , Yanming Qian , Yanqing Liu , Sheng Zhao , Jinyu Li , Furu Wei

Continuous Autoregressive Language Models

The efficiency of large language models (LLMs) is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis for LLM scaling: increasing the semantic…

Computation and Language · Computer Science 2025-11-03 Chenze Shao , Darren Li , Fandong Meng , Jie Zhou

Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior

Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody of synthesized speech. Such models typically incorporate a fine-grained variational autoencoder (VAE) structure, extracting…

Audio and Speech Processing · Electrical Eng. & Systems 2020-02-11 Guangzhi Sun , Yu Zhang , Ron J. Weiss , Yuan Cao , Heiga Zen , Andrew Rosenberg , Bhuvana Ramabhadran , Yonghui Wu