English
Related papers

Related papers: Single-Codec: Single-Codebook Speech Codec towards…

200 papers

Language model based text-to-speech (TTS) models, like VALL-E, have gained attention for their outstanding in-context learning capability in zero-shot scenarios. Neural speech codec is a critical component of these models, which can convert…

Sound · Computer Science 2024-03-12 Yong Ren , Tao Wang , Jiangyan Yi , Le Xu , Jianhua Tao , Chuyuan Zhang , Junzuo Zhou

Although discrete speech tokens have exhibited strong potential for language model-based speech generation, their high bitrates and redundant timbre information restrict the development of such models. In this work, we propose LSCodec, a…

Audio and Speech Processing · Electrical Eng. & Systems 2025-05-22 Yiwei Guo , Zhihan Li , Chenpeng Du , Hankun Wang , Xie Chen , Kai Yu

Neural speech codecs have demonstrated their ability to compress high-quality speech and audio by converting them into discrete token representations. Most existing methods utilize Residual Vector Quantization (RVQ) to encode speech into…

Sound · Computer Science 2024-10-22 Peiji Yang , Fengping Wang , Yicheng Zhong , Huawei Wei , Zhisheng Wang

Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting…

With the emergence of neural audio codecs, which encode multiple streams of discrete tokens from audio, large language models have recently gained attention as a promising approach for zero-shot Text-to-Speech (TTS) synthesis. Despite the…

Audio and Speech Processing · Electrical Eng. & Systems 2024-04-04 Jaehyeon Kim , Keon Lee , Seungjun Chung , Jaewoong Cho

Discretizing speech into tokens and generating them by a decoder-only model have been a promising direction for text-to-speech (TTS) and spoken language modeling (SLM). To shorten the sequence length of speech tokens, acoustic byte-pair…

Sound · Computer Science 2024-10-30 Bohan Li , Feiyu Shen , Yiwei Guo , Shuai Wang , Xie Chen , Kai Yu

Neural audio/speech coding has recently demonstrated its capability to deliver high quality at much lower bitrates than traditional methods. However, existing neural audio/speech codecs employ either acoustic features or learned blind…

Sound · Computer Science 2025-10-16 Xue Jiang , Xiulian Peng , Huaying Xue , Yuan Zhang , Yan Lu

The long speech sequence has been troubling language models (LM) based TTS approaches in terms of modeling complexity and efficiency. This work proposes SoCodec, a semantic-ordered multi-stream speech codec, to address this issue. It…

Sound · Computer Science 2024-09-04 Haohan Guo , Fenglong Xie , Kun Xie , Dongchao Yang , Dake Guo , Xixin Wu , Helen Meng

Historically, most speech models in machine-learning have used the mel-spectrogram as a speech representation. Recently, discrete audio tokens produced by neural audio codecs have become a popular alternate speech representation for speech…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-05 Ryan Langman , Ante Jukić , Kunal Dhawan , Nithin Rao Koluguri , Jason Li

Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modelling techniques to audio data. However, traditional codecs…

Sound · Computer Science 2024-12-02 Haohe Liu , Xuenan Xu , Yi Yuan , Mengyue Wu , Wenwu Wang , Mark D. Plumbley

Speech codecs serve as a crucial bridge in unifying speech and text language models. Existing codec methods face several challenges in semantic encoding, such as residual paralinguistic information (e.g., timbre, emotion), insufficient…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-06 Chunyu Qiang , Haoyu Wang , Cheng Gong , Tianrui Wang , Ruibo Fu , Tao Wang , Ruilong Chen , Jiangyan Yi , Zhengqi Wen , Chen Zhang , Longbiao Wang , Jianwu Dang , Jianhua Tao

Recent advances in large language models (LLMs) and development of audio codecs greatly propel the zero-shot TTS. They can synthesize personalized speech with only a 3-second speech of an unseen speaker as acoustic prompt. However, they…

Sound · Computer Science 2024-06-07 Jinlong Xue , Yayue Deng , Yicheng Han , Yingming Gao , Ya Li

The emergence of audio language models is empowered by neural audio codecs, which establish critical mappings between continuous waveforms and discrete tokens compatible with language model paradigms. The evolutionary trends from…

Audio and Speech Processing · Electrical Eng. & Systems 2025-02-28 Yidi Jiang , Qian Chen , Shengpeng Ji , Yu Xi , Wen Wang , Chong Zhang , Xianghu Yue , ShiLiang Zhang , Haizhou Li

In recent years, large language models have achieved significant success in generative tasks related to speech, audio, music, and other signal domains. A crucial element of these models is the discrete acoustic codecs, which serve as an…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-05 Shengpeng Ji , Minghui Fang , Jialong Zuo , Ziyue Jiang , Dingdong Wang , Hanting Wang , Hai Huang , Zhou Zhao

With recent rapid growth of large language models (LLMs), discrete speech tokenization has played an important role for injecting speech into LLMs. However, this discretization gives rise to a loss of information, consequently impairing…

Audio and Speech Processing · Electrical Eng. & Systems 2024-07-23 Zhichao Huang , Chutong Meng , Tom Ko

Speech separation (SS) has advanced significantly with neural network-based methods, showing improved performance on signal-level metrics. However, these methods often struggle to maintain speech intelligibility in the separated signals,…

Sound · Computer Science 2026-01-28 Tianhua Li , Chenda Li , Wei Wang , Xin Zhou , Xihui Chen , Jianqing Gao , Yanmin Qian

We propose a Multi-Stage, Multi-Codebook (MSMC) approach to high-performance neural TTS synthesis. A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data by…

Sound · Computer Science 2022-09-23 Haohan Guo , Fenglong Xie , Frank K. Soong , Xixin Wu , Helen Meng

High-fidelity neural audio codecs in Text-to-speech (TTS) aim to compress speech signals into discrete representations for faithful reconstruction. However, prior approaches faced challenges in effectively disentangling acoustic and…

Sound · Computer Science 2025-09-23 Ruonan Zhang , Xiaoyang Hao , Yichen Han , Junjie Cao , Yue Liu , Kai Zhang

Neural speech codecs have gained great attention for their outstanding reconstruction with discrete token representations. It is a crucial component in generative tasks such as speech coding and large language models (LLM). However, most…

Sound · Computer Science 2025-07-01 Youqiang Zheng , Weiping Tu , Yueteng Kang , Jie Chen , Yike Zhang , Li Xiao , Yuhong Yang , Long Ma

This work introduces TTS-Transducer - a novel architecture for text-to-speech, leveraging the strengths of audio codec models and neural transducers. Transducers, renowned for their superior quality and robustness in speech recognition, are…

Audio and Speech Processing · Electrical Eng. & Systems 2025-04-16 Vladimir Bataev , Subhankar Ghosh , Vitaly Lavrukhin , Jason Li
‹ Prev 1 2 3 10 Next ›