Related papers: dMel: Speech Tokenization made Simple

Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition

Discrete audio representation, aka audio tokenization, has seen renewed interest driven by its potential to facilitate the application of text language modeling approaches in audio domain. To this end, various compression and…

Audio and Speech Processing · Electrical Eng. & Systems 2023-09-21 Krishna C. Puvvada , Nithin Rao Koluguri , Kunal Dhawan , Jagadeesh Balam , Boris Ginsburg

SELM: Speech Enhancement Using Discrete Tokens and Language Models

Language models (LMs) have shown superior performances in various speech generation tasks recently, demonstrating their powerful ability for semantic context modeling. Given the intrinsic similarity between speech generation and speech…

Audio and Speech Processing · Electrical Eng. & Systems 2024-01-09 Ziqian Wang , Xinfa Zhu , Zihan Zhang , YuanJun Lv , Ning Jiang , Guoqing Zhao , Lei Xie

DM-Codec: Distilling Multimodal Representations for Speech Tokenization

Recent advancements in speech-language models have yielded significant improvements in speech tokenization and synthesis. However, effectively mapping the complex, multidimensional attributes of speech into discrete tokens remains…

Computation and Language · Computer Science 2025-09-30 Md Mubtasim Ahasan , Md Fahim , Tasnim Mohiuddin , A K M Mahbubur Rahman , Aman Chadha , Tariq Iqbal , M Ashraful Amin , Md Mofijul Islam , Amin Ahsan Ali

Recent Advances in Discrete Speech Tokens: A Review

The rapid advancement of speech generation technologies in the era of large language models (LLMs) has established discrete speech tokens as a foundational paradigm for speech representation. These tokens, characterized by their discrete,…

Audio and Speech Processing · Electrical Eng. & Systems 2025-12-15 Yiwei Guo , Zhihan Li , Hankun Wang , Bohan Li , Chongtian Shao , Hanglei Zhang , Chenpeng Du , Xie Chen , Shujie Liu , Kai Yu

Discrete Audio Tokens: More Than a Survey!

Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse…

Sound · Computer Science 2025-09-30 Pooneh Mousavi , Gallil Maimon , Adel Moumen , Darius Petermann , Jiatong Shi , Haibin Wu , Haici Yang , Anastasia Kuznetsova , Artem Ploujnikov , Ricard Marxer , Bhuvana Ramabhadran , Benjamin Elizalde , Loren Lugosch , Jinyu Li , Cem Subakan , Phil Woodland , Minje Kim , Hung-yi Lee , Shinji Watanabe , Yossi Adi , Mirco Ravanelli

RepCodec: A Speech Representation Codec for Speech Tokenization

With recent rapid growth of large language models (LLMs), discrete speech tokenization has played an important role for injecting speech into LLMs. However, this discretization gives rise to a loss of information, consequently impairing…

Audio and Speech Processing · Electrical Eng. & Systems 2024-07-23 Zhichao Huang , Chutong Meng , Tom Ko

TokSing: Singing Voice Synthesis based on Discrete Tokens

Recent advancements in speech synthesis witness significant benefits by leveraging discrete tokens extracted from self-supervised learning (SSL) models. Discrete tokens offer higher storage efficiency and greater operability in intermediate…

Sound · Computer Science 2024-06-21 Yuning Wu , Chunlei zhang , Jiatong Shi , Yuxun Tang , Shan Yang , Qin Jin

DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion

Speech tokenizers are a key building block of fully discrete Speech LLMs.Existing tokenizers either prioritize semantic encoding,fuse semantic content with acoustic style inseparably,or achieve incomplete semantic-acoustic…

Sound · Computer Science 2026-05-28 Hanlin Zhang , Daxin Tan , Dehua Tao , Xiao Chen , Haochen Tan , Yunhe Li , Yuchen Cao , Linqi Song

Towards Audio Token Compression in Large Audio Language Models

Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks, ranging from speech recognition to general audio understanding. However, their scalability is limited by the quadratic complexity of attention and…

Audio and Speech Processing · Electrical Eng. & Systems 2025-11-27 Saurabhchand Bhati , Samuel Thomas , Hilde Kuehne , Rogerio Feris , James Glass

Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR

Discrete audio representations are gaining traction in speech modeling due to their interpretability and compatibility with large language models, but are not always optimized for noisy or real-world environments. Building on existing works…

Computation and Language · Computer Science 2025-10-30 Shreyas Gopal , Ashutosh Anshul , Haoyang Li , Yue Heng Yeo , Hexin Liu , Eng Siong Chng

DiffSoundStream: Efficient Speech Tokenization via Diffusion Decoding

Token-based language modeling is a prominent approach for speech generation, where tokens are obtained by quantizing features from self-supervised learning (SSL) models and extracting codes from neural speech codecs, generally referred to…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-30 Yang Yang , Yunpeng Li , George Sung , Shao-Fu Shih , Craig Dooley , Alessio Centazzo , Ramanan Rajeswaran

LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization

With the rapid progress of speech language models (SLMs), discrete speech tokens have emerged as a core interface between speech and text, enabling unified modeling across modalities. Recent speech tokenization approaches aim to isolate…

Computation and Language · Computer Science 2025-06-23 Daejin Jo , Jeeyoung Yun , Byungseok Roh , Sungwoong Kim

MelTok: 2D Tokenization for Single-Codebook Audio Compression

Large Audio Language Models (LALMs) have emerged with strong performance across diverse audio understanding tasks and can be further enhanced by neural audio codecs. Transitioning from multi-layer residual vector quantizers to a…

Sound · Computer Science 2025-12-05 Jingyi Li , Zhiyuan Zhao , Zhisheng Zhang , Yunfei Liu , Lijian Lin , Ye Zhu , Jiahao Wu , Qiuqiang Kong , Yu Li

Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning

Self-supervised learning (SSL) of speech has shown impressive results in speech-related tasks, particularly in automatic speech recognition (ASR). While most methods employ the output of intermediate layers of the SSL model as real-valued…

Sound · Computer Science 2023-05-30 Xuankai Chang , Brian Yan , Yuya Fujita , Takashi Maekaku , Shinji Watanabe

Make Some Noise: Towards LLM audio reasoning and generation using sound tokens

Integrating audio comprehension and generation into large language models (LLMs) remains challenging due to the continuous nature of audio and the resulting high sampling rates. Here, we introduce a novel approach that combines Variational…

Audio and Speech Processing · Electrical Eng. & Systems 2025-03-31 Shivam Mehta , Nebojsa Jojic , Hannes Gamper

MELD: Mel-Spectrogram-Based Speech Language Modeling with Discrete Latent Variables

Recent speech language models rely on encoders that are optimized separately from autoregressive models. Since these encoders are unaware of the downstream objectives, the extracted representations may not be optimal for downstream tasks.…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-29 Sung-Lin Yeh , Wei Zhou , Gil Keren , Duc Le , Zhong Meng , Hao Tang , Jay Mahadeokar , Ozlem Kalinli , Alexandre Mourachko

Whisper-GPT: A Hybrid Representation Audio Large Language Model

We propose WHISPER-GPT: A generative large language model (LLM) for speech and music that allows us to work with continuous audio representations and discrete tokens simultaneously as part of a single architecture. There has been a huge…

Sound · Computer Science 2024-12-20 Prateek Verma

Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens

This paper introduces a discrete diffusion model (DDM) framework for text-aligned speech tokenization and reconstruction. By replacing the auto-regressive speech decoder with a discrete diffusion counterpart, our model achieves…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-25 Pin-Jui Ku , He Huang , Jean-Marie Lemercier , Subham Sekhar Sahoo , Zhehuai Chen , Ante Jukić

Scaling Spoken Language Models with Syllabic Speech Tokenization

Spoken language models (SLMs) typically discretize speech into high-frame-rate tokens extracted from SSL speech models. As the most successful LMs are based on the Transformer architecture, processing these long token streams with…

Computation and Language · Computer Science 2026-02-05 Nicholas Lee , Cheol Jun Cho , Alan W Black , Gopala K. Anumanchipalli

AudioLM: a Language Modeling Approach to Audio Generation

We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation…

Sound · Computer Science 2023-07-27 Zalán Borsos , Raphaël Marinier , Damien Vincent , Eugene Kharitonov , Olivier Pietquin , Matt Sharifi , Dominik Roblek , Olivier Teboul , David Grangier , Marco Tagliasacchi , Neil Zeghidour