Related papers: Single-Codec: Single-Codebook Speech Codec towards…

Fewer-token Neural Speech Codec with Time-invariant Codes

Language model based text-to-speech (TTS) models, like VALL-E, have gained attention for their outstanding in-context learning capability in zero-shot scenarios. Neural speech codec is a critical component of these models, which can convert…

Sound · Computer Science 2024-03-12 Yong Ren , Tao Wang , Jiangyan Yi , Le Xu , Jianhua Tao , Chuyuan Zhang , Junzuo Zhou

LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec

Although discrete speech tokens have exhibited strong potential for language model-based speech generation, their high bitrates and redundant timbre information restrict the development of such models. In this work, we propose LSCodec, a…

Audio and Speech Processing · Electrical Eng. & Systems 2025-05-22 Yiwei Guo , Zhihan Li , Chenpeng Du , Hankun Wang , Xie Chen , Kai Yu

Optimizing Neural Speech Codec for Low-Bitrate Compression via Multi-Scale Encoding

Neural speech codecs have demonstrated their ability to compress high-quality speech and audio by converting them into discrete token representations. Most existing methods utilize Residual Vector Quantization (RVQ) to encode speech into…

Sound · Computer Science 2024-10-22 Peiji Yang , Fengping Wang , Yicheng Zhong , Huawei Wei , Zhisheng Wang

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting…

Sound · Computer Science 2025-03-04 Xinsheng Wang , Mingqi Jiang , Ziyang Ma , Ziyu Zhang , Songxiang Liu , Linqin Li , Zheng Liang , Qixi Zheng , Rui Wang , Xiaoqin Feng , Weizhen Bian , Zhen Ye , Sitong Cheng , Ruibin Yuan , Zhixian Zhao , Xinfa Zhu , Jiahao Pan , Liumeng Xue , Pengcheng Zhu , Yunlin Chen , Zhifei Li , Xie Chen , Lei Xie , Yike Guo , Wei Xue

CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech

With the emergence of neural audio codecs, which encode multiple streams of discrete tokens from audio, large language models have recently gained attention as a promising approach for zero-shot Text-to-Speech (TTS) synthesis. Despite the…

Audio and Speech Processing · Electrical Eng. & Systems 2024-04-04 Jaehyeon Kim , Keon Lee , Seungjun Chung , Jaewoong Cho

On the Effectiveness of Acoustic BPE in Decoder-Only TTS

Discretizing speech into tokens and generating them by a decoder-only model have been a promising direction for text-to-speech (TTS) and spoken language modeling (SLM). To shorten the sequence length of speech tokens, acoustic byte-pair…

Sound · Computer Science 2024-10-30 Bohan Li , Feiyu Shen , Yiwei Guo , Shuai Wang , Xie Chen , Kai Yu

Latent-Domain Predictive Neural Speech Coding

Neural audio/speech coding has recently demonstrated its capability to deliver high quality at much lower bitrates than traditional methods. However, existing neural audio/speech codecs employ either acoustic features or learned blind…

Sound · Computer Science 2025-10-16 Xue Jiang , Xiulian Peng , Huaying Xue , Yuan Zhang , Yan Lu

SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis

The long speech sequence has been troubling language models (LM) based TTS approaches in terms of modeling complexity and efficiency. This work proposes SoCodec, a semantic-ordered multi-stream speech codec, to address this issue. It…

Sound · Computer Science 2024-09-04 Haohan Guo , Fenglong Xie , Kun Xie , Dongchao Yang , Dake Guo , Xixin Wu , Helen Meng

Spectral Codecs: Improving Non-Autoregressive Speech Synthesis with Spectrogram-Based Audio Codecs

Historically, most speech models in machine-learning have used the mel-spectrogram as a speech representation. Recently, discrete audio tokens produced by neural audio codecs have become a popular alternate speech representation for speech…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-05 Ryan Langman , Ante Jukić , Kunal Dhawan , Nithin Rao Koluguri , Jason Li

SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound

Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modelling techniques to audio data. However, traditional codecs…

Sound · Computer Science 2024-12-02 Haohe Liu , Xuenan Xu , Yi Yuan , Mengyue Wu , Wenwu Wang , Mark D. Plumbley

SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec

Speech codecs serve as a crucial bridge in unifying speech and text language models. Existing codec methods face several challenges in semantic encoding, such as residual paralinguistic information (e.g., timbre, emotion), insufficient…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-06 Chunyu Qiang , Haoyu Wang , Cheng Gong , Tianrui Wang , Ruibo Fu , Tao Wang , Ruilong Chen , Jiangyan Yi , Zhengqi Wen , Chen Zhang , Longbiao Wang , Jianwu Dang , Jianhua Tao

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model

Recent advances in large language models (LLMs) and development of audio codecs greatly propel the zero-shot TTS. They can synthesize personalized speech with only a 3-second speech of an unseen speaker as acoustic prompt. However, they…

Sound · Computer Science 2024-06-07 Jinlong Xue , Yayue Deng , Yicheng Han , Yingming Gao , Ya Li

UniCodec: Unified Audio Codec with Single Domain-Adaptive Codebook

The emergence of audio language models is empowered by neural audio codecs, which establish critical mappings between continuous waveforms and discrete tokens compatible with language model paradigms. The evolutionary trends from…

Audio and Speech Processing · Electrical Eng. & Systems 2025-02-28 Yidi Jiang , Qian Chen , Shengpeng Ji , Yu Xi , Wen Wang , Chong Zhang , Xianghu Yue , ShiLiang Zhang , Haizhou Li

Language-Codec: Bridging Discrete Codec Representations and Speech Language Models

In recent years, large language models have achieved significant success in generative tasks related to speech, audio, music, and other signal domains. A crucial element of these models is the discrete acoustic codecs, which serve as an…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-05 Shengpeng Ji , Minghui Fang , Jialong Zuo , Ziyue Jiang , Dingdong Wang , Hanting Wang , Hai Huang , Zhou Zhao

RepCodec: A Speech Representation Codec for Speech Tokenization

With recent rapid growth of large language models (LLMs), discrete speech tokenization has played an important role for injecting speech into LLMs. However, this discretization gives rise to a loss of information, consequently impairing…

Audio and Speech Processing · Electrical Eng. & Systems 2024-07-23 Zhichao Huang , Chutong Meng , Tom Ko

SLM-SS: Speech Language Model for Generative Speech Separation

Speech separation (SS) has advanced significantly with neural network-based methods, showing improved performance on signal-level metrics. However, these methods often struggle to maintain speech intelligibility in the separated signals,…

Sound · Computer Science 2026-01-28 Tianhua Li , Chenda Li , Wei Wang , Xin Zhou , Xihui Chen , Jianqing Gao , Yanmin Qian

A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS

We propose a Multi-Stage, Multi-Codebook (MSMC) approach to high-performance neural TTS synthesis. A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data by…

Sound · Computer Science 2022-09-23 Haohan Guo , Fenglong Xie , Frank K. Soong , Xixin Wu , Helen Meng

MBCodec:Thorough disentangle for high-fidelity audio compression

High-fidelity neural audio codecs in Text-to-speech (TTS) aim to compress speech signals into discrete representations for faithful reconstruction. However, prior approaches faced challenges in effectively disentangling acoustic and…

Sound · Computer Science 2025-09-23 Ruonan Zhang , Xiaoyang Hao , Yichen Han , Junjie Cao , Yue Liu , Kai Zhang

FreeCodec: A disentangled neural speech codec with fewer tokens

Neural speech codecs have gained great attention for their outstanding reconstruction with discrete token representations. It is a crucial component in generative tasks such as speech coding and large language models (LLM). However, most…

Sound · Computer Science 2025-07-01 Youqiang Zheng , Weiping Tu , Yueteng Kang , Jie Chen , Yike Zhang , Li Xiao , Yuhong Yang , Long Ma

TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer

This work introduces TTS-Transducer - a novel architecture for text-to-speech, leveraging the strengths of audio codec models and neural transducers. Transducers, renowned for their superior quality and robustness in speech recognition, are…

Audio and Speech Processing · Electrical Eng. & Systems 2025-04-16 Vladimir Bataev , Subhankar Ghosh , Vitaly Lavrukhin , Jason Li