English
Related papers

Related papers: dMel: Speech Tokenization made Simple

200 papers

Discrete audio representation, aka audio tokenization, has seen renewed interest driven by its potential to facilitate the application of text language modeling approaches in audio domain. To this end, various compression and…

Audio and Speech Processing · Electrical Eng. & Systems 2023-09-21 Krishna C. Puvvada , Nithin Rao Koluguri , Kunal Dhawan , Jagadeesh Balam , Boris Ginsburg

Language models (LMs) have shown superior performances in various speech generation tasks recently, demonstrating their powerful ability for semantic context modeling. Given the intrinsic similarity between speech generation and speech…

Audio and Speech Processing · Electrical Eng. & Systems 2024-01-09 Ziqian Wang , Xinfa Zhu , Zihan Zhang , YuanJun Lv , Ning Jiang , Guoqing Zhao , Lei Xie

Recent advancements in speech-language models have yielded significant improvements in speech tokenization and synthesis. However, effectively mapping the complex, multidimensional attributes of speech into discrete tokens remains…

The rapid advancement of speech generation technologies in the era of large language models (LLMs) has established discrete speech tokens as a foundational paradigm for speech representation. These tokens, characterized by their discrete,…

Audio and Speech Processing · Electrical Eng. & Systems 2025-12-15 Yiwei Guo , Zhihan Li , Hankun Wang , Bohan Li , Chongtian Shao , Hanglei Zhang , Chenpeng Du , Xie Chen , Shujie Liu , Kai Yu

Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse…

With recent rapid growth of large language models (LLMs), discrete speech tokenization has played an important role for injecting speech into LLMs. However, this discretization gives rise to a loss of information, consequently impairing…

Audio and Speech Processing · Electrical Eng. & Systems 2024-07-23 Zhichao Huang , Chutong Meng , Tom Ko

Recent advancements in speech synthesis witness significant benefits by leveraging discrete tokens extracted from self-supervised learning (SSL) models. Discrete tokens offer higher storage efficiency and greater operability in intermediate…

Sound · Computer Science 2024-06-21 Yuning Wu , Chunlei zhang , Jiatong Shi , Yuxun Tang , Shan Yang , Qin Jin

Speech tokenizers are a key building block of fully discrete Speech LLMs.Existing tokenizers either prioritize semantic encoding,fuse semantic content with acoustic style inseparably,or achieve incomplete semantic-acoustic…

Sound · Computer Science 2026-05-28 Hanlin Zhang , Daxin Tan , Dehua Tao , Xiao Chen , Haochen Tan , Yunhe Li , Yuchen Cao , Linqi Song

Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks, ranging from speech recognition to general audio understanding. However, their scalability is limited by the quadratic complexity of attention and…

Audio and Speech Processing · Electrical Eng. & Systems 2025-11-27 Saurabhchand Bhati , Samuel Thomas , Hilde Kuehne , Rogerio Feris , James Glass

Discrete audio representations are gaining traction in speech modeling due to their interpretability and compatibility with large language models, but are not always optimized for noisy or real-world environments. Building on existing works…

Computation and Language · Computer Science 2025-10-30 Shreyas Gopal , Ashutosh Anshul , Haoyang Li , Yue Heng Yeo , Hexin Liu , Eng Siong Chng

Token-based language modeling is a prominent approach for speech generation, where tokens are obtained by quantizing features from self-supervised learning (SSL) models and extracting codes from neural speech codecs, generally referred to…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-30 Yang Yang , Yunpeng Li , George Sung , Shao-Fu Shih , Craig Dooley , Alessio Centazzo , Ramanan Rajeswaran

With the rapid progress of speech language models (SLMs), discrete speech tokens have emerged as a core interface between speech and text, enabling unified modeling across modalities. Recent speech tokenization approaches aim to isolate…

Computation and Language · Computer Science 2025-06-23 Daejin Jo , Jeeyoung Yun , Byungseok Roh , Sungwoong Kim

Large Audio Language Models (LALMs) have emerged with strong performance across diverse audio understanding tasks and can be further enhanced by neural audio codecs. Transitioning from multi-layer residual vector quantizers to a…

Sound · Computer Science 2025-12-05 Jingyi Li , Zhiyuan Zhao , Zhisheng Zhang , Yunfei Liu , Lijian Lin , Ye Zhu , Jiahao Wu , Qiuqiang Kong , Yu Li

Self-supervised learning (SSL) of speech has shown impressive results in speech-related tasks, particularly in automatic speech recognition (ASR). While most methods employ the output of intermediate layers of the SSL model as real-valued…

Sound · Computer Science 2023-05-30 Xuankai Chang , Brian Yan , Yuya Fujita , Takashi Maekaku , Shinji Watanabe

Integrating audio comprehension and generation into large language models (LLMs) remains challenging due to the continuous nature of audio and the resulting high sampling rates. Here, we introduce a novel approach that combines Variational…

Audio and Speech Processing · Electrical Eng. & Systems 2025-03-31 Shivam Mehta , Nebojsa Jojic , Hannes Gamper

Recent speech language models rely on encoders that are optimized separately from autoregressive models. Since these encoders are unaware of the downstream objectives, the extracted representations may not be optimal for downstream tasks.…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-29 Sung-Lin Yeh , Wei Zhou , Gil Keren , Duc Le , Zhong Meng , Hao Tang , Jay Mahadeokar , Ozlem Kalinli , Alexandre Mourachko

We propose WHISPER-GPT: A generative large language model (LLM) for speech and music that allows us to work with continuous audio representations and discrete tokens simultaneously as part of a single architecture. There has been a huge…

Sound · Computer Science 2024-12-20 Prateek Verma

This paper introduces a discrete diffusion model (DDM) framework for text-aligned speech tokenization and reconstruction. By replacing the auto-regressive speech decoder with a discrete diffusion counterpart, our model achieves…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-25 Pin-Jui Ku , He Huang , Jean-Marie Lemercier , Subham Sekhar Sahoo , Zhehuai Chen , Ante Jukić

Spoken language models (SLMs) typically discretize speech into high-frame-rate tokens extracted from SSL speech models. As the most successful LMs are based on the Transformer architecture, processing these long token streams with…

Computation and Language · Computer Science 2026-02-05 Nicholas Lee , Cheol Jun Cho , Alan W Black , Gopala K. Anumanchipalli

We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation…

‹ Prev 1 2 3 10 Next ›