English
Related papers

Related papers: Efficient Parallel Audio Generation using Group Ma…

200 papers

We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the…

Autoregressive language models are the currently dominant paradigm for text generation, but they have some fundamental limitations that cannot be remedied by scale-for example inherently sequential and unidirectional generation. While…

Computation and Language · Computer Science 2024-08-01 Yuchen Li , Alexandre Kirchmeyer , Aashay Mehta , Yilong Qin , Boris Dadachev , Kishore Papineni , Sanjiv Kumar , Andrej Risteski

Speech representations learned from Self-supervised learning (SSL) models can benefit various speech processing tasks. However, utilizing SSL representations usually requires fine-tuning the pre-trained models or designing task-specific…

Audio and Speech Processing · Electrical Eng. & Systems 2022-07-12 Kai-Wei Chang , Wei-Cheng Tseng , Shang-Wen Li , Hung-yi Lee

Multimodal language models that process both text and speech have a potential for applications in spoken dialogue systems. However, current models face two major challenges in response generation latency: (1) generating a spoken response…

Computation and Language · Computer Science 2024-10-04 Kentaro Mitsui , Koh Mitsuda , Toshiaki Wakatsuki , Yukiya Hono , Kei Sawada

Large-scale multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio…

In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter than a…

Computation and Language · Computer Science 2023-10-10 Robin Algayres , Yossi Adi , Tu Anh Nguyen , Jade Copet , Gabriel Synnaeve , Benoit Sagot , Emmanuel Dupoux

Text-to-audio generation synthesizes realistic sounds or music given a natural language prompt. Diffusion-based frameworks, including the Tango and the AudioLDM series, represent the state-of-the-art in text-to-audio generation. Despite…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-03 Kuan-Po Huang , Shu-wen Yang , Huy Phan , Bo-Ru Lu , Byeonggeun Kim , Sashank Macha , Qingming Tang , Shalini Ghosh , Hung-yi Lee , Chieh-Chi Kao , Chao Wang

We propose WHISPER-GPT: A generative large language model (LLM) for speech and music that allows us to work with continuous audio representations and discrete tokens simultaneously as part of a single architecture. There has been a huge…

Sound · Computer Science 2024-12-20 Prateek Verma

Masked diffusion models (MDMs) for text offer a compelling alternative to traditional autoregressive language models. Parallel generation makes them efficient, but their computational capabilities and the limitations inherent in their…

Machine Learning · Computer Science 2026-04-28 Anej Svete , Ashish Sabharwal

Real-time Spoken Language Models (SLMs) struggle to leverage Chain-of-Thought (CoT) reasoning due to the prohibitive latency of generating the entire thought process sequentially. Enabling SLMs to think while speaking, similar to humans, is…

Computation and Language · Computer Science 2026-05-12 Donghang Wu , Haoyang Zhang , Jun Chen , Xiangyu , Zhang , Hexin Liu , Eng Siong Chng , Fei Tian , Xuerui Yang , Xiangyu Zhang , Daxin Jiang , Gang Yu

Recent breakthroughs in language-queried audio source separation (LASS) have shown that generative models can achieve higher separation audio quality than traditional masking-based approaches. However, two key limitations restrict their…

Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios…

Computation and Language · Computer Science 2024-01-04 Coleman Hooper , Sehoon Kim , Hiva Mohammadzadeh , Hasan Genc , Kurt Keutzer , Amir Gholami , Sophia Shao

Current mainstream audio generation methods primarily rely on simple text prompts, often failing to capture the nuanced details necessary for multi-style audio generation. To address this limitation, the Sound Event Enhanced Prompt Adapter…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-17 Chenxu Xiong , Ruibo Fu , Shuchen Shi , Zhengqi Wen , Jianhua Tao , Tao Wang , Chenxing Li , Chunyu Qiang , Yuankun Xie , Xin Qi , Guanjun Li , Zizheng Yang

We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation…

Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds…

Audio and Speech Processing · Electrical Eng. & Systems 2024-07-10 Huadai Liu , Rongjie Huang , Yang Liu , Hengyuan Cao , Jialei Wang , Xize Cheng , Siqi Zheng , Zhou Zhao

While recent advancements in speech language models have achieved significant progress, they face remarkable challenges in modeling the long acoustic sequences of neural audio codecs. In this paper, we introduce \textbf{G}enerative…

Computation and Language · Computer Science 2024-11-04 Yongxin Zhu , Dan Su , Liqiang He , Linli Xu , Dong Yu

Masked Generative Models (MGM)s demonstrate strong capabilities in generating high-fidelity images. However, they need many sampling steps to create high-quality generations, resulting in slow inference speed. In this work, we propose…

Computer Vision and Pattern Recognition · Computer Science 2025-12-02 Pranav Subbaraman , Shufan Li , Siyan Zhao , Aditya Grover

We design and evaluate a Bayesian optimization framework for resource efficient pre-training of Transformer-based language models (TLMs). TLM pre-training requires high computational resources and introduces many unresolved design choices,…

Computation and Language · Computer Science 2023-05-31 Iñigo Urteaga , Moulay-Zaïdane Draïdia , Tomer Lancewicki , Shahram Khadivi

Masked diffusion models (MDMs) have emerged as a promising approach for language modeling, yet they face a performance gap compared to autoregressive models (ARMs) and require more training iterations. In this work, we present the…

Machine Learning · Computer Science 2026-01-26 Mahdi Karami , Ali Ghodsi

We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, textless spoken language models struggle to generate plausible speech past…

Computation and Language · Computer Science 2025-07-11 Se Jin Park , Julian Salazar , Aren Jansen , Keisuke Kinoshita , Yong Man Ro , RJ Skerry-Ryan
‹ Prev 1 2 3 10 Next ›