Related papers: TS3-Codec: Transformer-Based Simple Streaming Sing…

SpecTokenizer: A Lightweight Streaming Codec in the Compressed Spectrum Domain

Neural Audio Codecs (NACs) have gained growing attention in recent years as technologies for audio compression and audio representation in speech language models. While mainstream NACs typically require G-level computation and M-level…

Audio and Speech Processing · Electrical Eng. & Systems 2025-10-27 Zixiang Wan , Guochang Zhang , Yifeng He , Jianqiang Wei

UniSRCodec: Unified and Low-Bitrate Single Codebook Codec with Sub-Band Reconstruction

Neural Audio Codecs (NACs) can reduce transmission overhead by performing compact compression and reconstruction, which also aim to bridge the gap between continuous and discrete signals. Existing NACs can be divided into two categories:…

Sound · Computer Science 2026-01-07 Zhisheng Zhang , Xiang Li , Yixuan Zhou , Jing Peng , Shengbo Cai , Guoyang Zeng , Zhiyong Wu

SUNAC: Source-aware Unified Neural Audio Codec

Neural audio codecs (NACs) provide compact representations that can be leveraged in many downstream applications, in particular large language models. Yet most NACs encode mixtures of multiple sources in an entangled manner, which may…

Audio and Speech Processing · Electrical Eng. & Systems 2025-11-21 Ryo Aihara , Yoshiki Masuyama , Francesco Paissan , François G. Germain , Gordon Wichern , Jonathan Le Roux

L3AC: Towards a Lightweight and Lossless Audio Codec

Neural audio codecs have recently gained traction for their ability to compress high-fidelity audio and provide discrete tokens for generative modeling. However, leading approaches often rely on resource-intensive models and complex…

Sound · Computer Science 2025-08-18 Linwei Zhai , Han Ding , Cui Zhao , fei wang , Ge Wang , Wang Zhi , Wei Xi

SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

In this study, we propose a simple and efficient Non-Autoregressive (NAR) text-to-speech (TTS) system based on diffusion, named SimpleSpeech. Its simpleness shows in three aspects: (1) It can be trained on the speech-only dataset, without…

Sound · Computer Science 2024-06-17 Dongchao Yang , Dingdong Wang , Haohan Guo , Xueyuan Chen , Xixin Wu , Helen Meng

TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer

This work introduces TTS-Transducer - a novel architecture for text-to-speech, leveraging the strengths of audio codec models and neural transducers. Transducers, renowned for their superior quality and robustness in speech recognition, are…

Audio and Speech Processing · Electrical Eng. & Systems 2025-04-16 Vladimir Bataev , Subhankar Ghosh , Vitaly Lavrukhin , Jason Li

Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition

Transformer has achieved competitive performance against state-of-the-art end-to-end models in automatic speech recognition (ASR), and requires significantly less training time than RNN-based models. The original Transformer, with…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-14 Wenyong Huang , Wenchao Hu , Yu Ting Yeung , Xiao Chen

Analysis of Speaker Verification Performance Trade-offs with Neural Audio Codec Transmission

Neural audio codecs (NACs) have made significant advancements in recent years and are rapidly being adopted in many audio processing pipelines. However, they can introduce audio distortions which degrade speaker verification (SV)…

Sound · Computer Science 2025-09-04 Nirmalya Mallick Thakur , Jia Qi Yip , Eng Siong Chng

Towards Audio Codec-based Speech Separation

Recent improvements in neural audio codec (NAC) models have generated interest in adopting pre-trained codecs for a variety of speech processing applications to take advantage of the efficiencies gained from high compression, but these have…

Sound · Computer Science 2024-07-08 Jia Qi Yip , Shengkui Zhao , Dianwen Ng , Eng Siong Chng , Bin Ma

AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec

A good audio codec for live applications such as telecommunication is characterized by three key properties: (1) compression, i.e.\ the bitrate that is required to transmit the signal should be as low as possible; (2) latency, i.e.\…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-29 Yi-Chiao Wu , Israel D. Gebru , Dejan Marković , Alexander Richard

Exploring Disentangled Neural Speech Codecs from Self-Supervised Representations

Neural audio codecs (NACs), which use neural networks to generate compact audio representations, have garnered interest for their applicability to many downstream tasks -- especially quantized codecs due to their compatibility with large…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-13 Ryo Aihara , Yoshiki Masuyama , Gordon Wichern , François G. Germain , Jonathan Le Roux

Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates

Neural Audio Codecs (NACs) have become increasingly adopted in speech processing tasks due to their excellent rate-distortion performance and compatibility with Large Language Models (LLMs) as discrete feature representations for audio…

Sound · Computer Science 2025-09-15 Harry Julian , Rachel Beeson , Lohith Konathala , Johanna Ulin , Jiameng Gao

TQCodec: Towards neural audio codec for high-fidelity music streaming

We propose TQCodec, a neural audio codec designed for high-bitrate, high-fidelity music streaming. Unlike existing neural codecs that primarily target ultra-low bitrates (<= 16kbps), TQCodec operates at 44.1 kHz and supports bitrates from…

Sound · Computer Science 2026-03-03 Lixing He , Zhouxuan Chen , Mingshuai Liu , Xinran Sun , Wucheng Wang , Minfu Li , Lingcheng Kong , Weifeng Zhao , Wenjiang Zhou

T-Mimi: A Transformer-based Mimi Decoder for Real-Time On-Phone TTS

Neural audio codecs provide promising acoustic features for speech synthesis, with representative streaming codecs like Mimi providing high-quality acoustic features for real-time Text-to-Speech (TTS) applications. However, Mimi's decoder,…

Audio and Speech Processing · Electrical Eng. & Systems 2026-01-29 Haibin Wu , Bach Viet Do , Naveen Suda , Julian Chan , Madhavan C R , Gene-Ping Yang , Yi-Chiao Wu , Naoyuki Kanda , Yossef Adi , Xin Lei , Yue Liu , Florian Metze , Yuzong Liu

Audio Transformers

Over the past two decades, CNN architectures have produced compelling models of sound perception and cognition, learning hierarchical organizations of features. Analogous to successes in computer vision, audio feature classification can be…

Sound · Computer Science 2025-05-13 Prateek Verma , Jonathan Berger

ConvRNN-T: Convolutional Augmented Recurrent Neural Network Transducers for Streaming Speech Recognition

The recurrent neural network transducer (RNN-T) is a prominent streaming end-to-end (E2E) ASR technology. In RNN-T, the acoustic encoder commonly consists of stacks of LSTMs. Very recently, as an alternative to LSTM layers, the Conformer…

Sound · Computer Science 2022-09-30 Martin Radfar , Rohit Barnwal , Rupak Vignesh Swaminathan , Feng-Ju Chang , Grant P. Strimel , Nathan Susanj , Athanasios Mouchtaris

SNAC: Multi-Scale Neural Audio Codec

Neural audio codecs have recently gained popularity because they can represent audio signals with high fidelity at very low bitrates, making it feasible to use language modeling approaches for audio generation and understanding. Residual…

Sound · Computer Science 2024-10-21 Hubert Siuzdak , Florian Grötschla , Luca A. Lanzendörfer

Modeling strategies for speech enhancement in the latent space of a neural audio codec

Neural audio codecs (NACs) provide compact latent speech representations in the form of sequences of continuous vectors or discrete tokens. In this work, we investigate how these two types of speech representations compare when used as…

Sound · Computer Science 2026-03-12 Sofiene Kammoun , Xavier Alameda-Pineda , Simon Leglaive

ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers

Neural speech codecs aim to compress input signals into minimal bits while maintaining content quality in a low-latency manner. However, existing neural codecs often trade model complexity for reconstruction performance. These codecs…

Sound · Computer Science 2024-10-04 Yuzhe Gu , Enmao Diao

Ultra-Low-Bitrate Speech Coding with Pretrained Transformers

Speech coding facilitates the transmission of speech over low-bandwidth networks with minimal distortion. Neural-network based speech codecs have recently demonstrated significant improvements in quality over traditional approaches. While…

Sound · Computer Science 2022-07-07 Ali Siahkoohi , Michael Chinen , Tom Denton , W. Bastiaan Kleijn , Jan Skoglund