Related papers: Efficient Parallel Audio Generation using Group Ma…

SoundStorm: Efficient Parallel Audio Generation

We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the…

Sound · Computer Science 2023-05-17 Zalán Borsos , Matt Sharifi , Damien Vincent , Eugene Kharitonov , Neil Zeghidour , Marco Tagliasacchi

Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines

Autoregressive language models are the currently dominant paradigm for text generation, but they have some fundamental limitations that cannot be remedied by scale-for example inherently sequential and unidirectional generation. While…

Computation and Language · Computer Science 2024-08-01 Yuchen Li , Alexandre Kirchmeyer , Aashay Mehta , Yilong Qin , Boris Dadachev , Kishore Papineni , Sanjiv Kumar , Andrej Risteski

SpeechPrompt: An Exploration of Prompt Tuning on Generative Spoken Language Model for Speech Processing Tasks

Speech representations learned from Self-supervised learning (SSL) models can benefit various speech processing tasks. However, utilizing SSL representations usually requires fine-tuning the pre-trained models or designing task-specific…

Audio and Speech Processing · Electrical Eng. & Systems 2022-07-12 Kai-Wei Chang , Wei-Cheng Tseng , Shang-Wen Li , Hung-yi Lee

PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems

Multimodal language models that process both text and speech have a potential for applications in spoken dialogue systems. However, current models face two major challenges in response generation latency: (1) generating a spoken response…

Computation and Language · Computer Science 2024-10-04 Kentaro Mitsui , Koh Mitsuda , Toshiaki Wakatsuki , Yukiya Hono , Kei Sawada

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Large-scale multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio…

Sound · Computer Science 2023-01-31 Rongjie Huang , Jiawei Huang , Dongchao Yang , Yi Ren , Luping Liu , Mingze Li , Zhenhui Ye , Jinglin Liu , Xiang Yin , Zhou Zhao

Generative Spoken Language Model based on continuous word-sized audio tokens

In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter than a…

Computation and Language · Computer Science 2023-10-10 Robin Algayres , Yossi Adi , Tu Anh Nguyen , Jade Copet , Gabriel Synnaeve , Benoit Sagot , Emmanuel Dupoux

IMPACT: Iterative Mask-based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling

Text-to-audio generation synthesizes realistic sounds or music given a natural language prompt. Diffusion-based frameworks, including the Tango and the AudioLDM series, represent the state-of-the-art in text-to-audio generation. Despite…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-03 Kuan-Po Huang , Shu-wen Yang , Huy Phan , Bo-Ru Lu , Byeonggeun Kim , Sashank Macha , Qingming Tang , Shalini Ghosh , Hung-yi Lee , Chieh-Chi Kao , Chao Wang

Whisper-GPT: A Hybrid Representation Audio Large Language Model

We propose WHISPER-GPT: A generative large language model (LLM) for speech and music that allows us to work with continuous audio representations and discrete tokens simultaneously as part of a single architecture. There has been a huge…

Sound · Computer Science 2024-12-20 Prateek Verma

On the Reasoning Abilities of Masked Diffusion Language Models

Masked diffusion models (MDMs) for text offer a compelling alternative to traditional autoregressive language models. Parallel generation makes them efficient, but their computational capabilities and the limitations inherent in their…

Machine Learning · Computer Science 2026-04-28 Anej Svete , Ashish Sabharwal

Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models

Real-time Spoken Language Models (SLMs) struggle to leverage Chain-of-Thought (CoT) reasoning due to the prohibitive latency of generating the entire thought process sequentially. Enabling SLMs to think while speaking, similar to humans, is…

Computation and Language · Computer Science 2026-05-12 Donghang Wu , Haoyang Zhang , Jun Chen , Xiangyu , Zhang , Hexin Liu , Eng Siong Chng , Fei Tian , Xuerui Yang , Xiangyu Zhang , Daxin Jiang , Gang Yu

PromptSep: Generative Audio Separation via Multimodal Prompting

Recent breakthroughs in language-queried audio source separation (LASS) have shown that generative models can achieve higher separation audio quality than traditional masking-based approaches. However, two key limitations restrict their…

Sound · Computer Science 2025-11-07 Yutong Wen , Ke Chen , Prem Seetharaman , Oriol Nieto , Jiaqi Su , Rithesh Kumar , Minje Kim , Paris Smaragdis , Zeyu Jin , Justin Salamon

SPEED: Speculative Pipelined Execution for Efficient Decoding

Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios…

Computation and Language · Computer Science 2024-01-04 Coleman Hooper , Sehoon Kim , Hiva Mohammadzadeh , Hasan Genc , Kurt Keutzer , Amir Gholami , Sophia Shao

Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation

Current mainstream audio generation methods primarily rely on simple text prompts, often failing to capture the nuanced details necessary for multi-style audio generation. To address this limitation, the Sound Event Enhanced Prompt Adapter…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-17 Chenxu Xiong , Ruibo Fu , Shuchen Shi , Zhengqi Wen , Jianhua Tao , Tao Wang , Chenxing Li , Chunyu Qiang , Yuankun Xie , Xin Qi , Guanjun Li , Zizheng Yang

AudioLM: a Language Modeling Approach to Audio Generation

We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation…

Sound · Computer Science 2023-07-27 Zalán Borsos , Raphaël Marinier , Damien Vincent , Eugene Kharitonov , Olivier Pietquin , Matt Sharifi , Dominik Roblek , Olivier Teboul , David Grangier , Marco Tagliasacchi , Neil Zeghidour

AudioLCM: Text-to-Audio Generation with Latent Consistency Models

Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds…

Audio and Speech Processing · Electrical Eng. & Systems 2024-07-10 Huadai Liu , Rongjie Huang , Yang Liu , Hengyuan Cao , Jialei Wang , Xize Cheng , Siqi Zheng , Zhou Zhao

Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer

While recent advancements in speech language models have achieved significant progress, they face remarkable challenges in modeling the long acoustic sequences of neural audio codecs. In this paper, we introduce \textbf{G}enerative…

Computation and Language · Computer Science 2024-11-04 Yongxin Zhu , Dan Su , Liqiang He , Linli Xu , Dong Yu

Accelerating Inference of Masked Image Generators via Reinforcement Learning

Masked Generative Models (MGM)s demonstrate strong capabilities in generating high-fidelity images. However, they need many sampling steps to create high-quality generations, resulting in slow inference speed. In this work, we propose…

Computer Vision and Pattern Recognition · Computer Science 2025-12-02 Pranav Subbaraman , Shufan Li , Siyan Zhao , Aditya Grover

Multi-armed bandits for resource efficient, online optimization of language model pre-training: the use case of dynamic masking

We design and evaluate a Bayesian optimization framework for resource efficient pre-training of Transformer-based language models (TLMs). TLM pre-training requires high computational resources and introduces many unresolved design choices,…

Computation and Language · Computer Science 2023-05-31 Iñigo Urteaga , Moulay-Zaïdane Draïdia , Tomer Lancewicki , Shahram Khadivi

Auto-Regressive Masked Diffusion Models

Masked diffusion models (MDMs) have emerged as a promising approach for language modeling, yet they face a performance gap compared to autoregressive models (ARMs) and require more training iterations. In this work, we present the…

Machine Learning · Computer Science 2026-01-26 Mahdi Karami , Ali Ghodsi

Long-Form Speech Generation with Spoken Language Models

We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, textless spoken language models struggle to generate plausible speech past…

Computation and Language · Computer Science 2025-07-11 Se Jin Park , Julian Salazar , Aren Jansen , Keisuke Kinoshita , Yong Man Ro , RJ Skerry-Ryan