Related papers: Audio Generation with Multiple Conditional Diffusi…

AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

In recent years, image generation has shown a great leap in performance, where diffusion models play a central role. Although generating high-quality images, such models are mainly conditioned on textual descriptions. This begs the…

Sound · Computer Science 2023-05-23 Guy Yariv , Itai Gat , Lior Wolf , Yossi Adi , Idan Schwartz

AudioGen: Textually Guided Audio Generation

We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs. AudioGen operates…

Sound · Computer Science 2023-03-07 Felix Kreuk , Gabriel Synnaeve , Adam Polyak , Uriel Singer , Alexandre Défossez , Jade Copet , Devi Parikh , Yaniv Taigman , Yossi Adi

Music ControlNet: Multiple Time-varying Controls for Music Generation

Text-to-music generation models are now capable of generating high-quality music audio in broad styles. However, text control is primarily suitable for the manipulation of global musical attributes like genre, mood, and tempo, and is less…

Sound · Computer Science 2023-11-14 Shih-Lun Wu , Chris Donahue , Shinji Watanabe , Nicholas J. Bryan

Controllable Music Production with Diffusion Models and Guidance Gradients

We demonstrate how conditional generation from diffusion models can be used to tackle a variety of realistic tasks in the production of music in 44.1kHz stereo audio with sampling-time guidance. The scenarios we consider include…

Sound · Computer Science 2023-12-06 Mark Levy , Bruno Di Giorgi , Floris Weers , Angelos Katharopoulos , Tom Nickson

SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models

We are witnessing a revolution in conditional image synthesis with the recent success of large scale text-to-image generation methods. This success also opens up new opportunities in controlling the generation and editing process using…

Computer Vision and Pattern Recognition · Computer Science 2024-05-03 Burak Can Biner , Farrin Marouf Sofian , Umur Berkay Karakaş , Duygu Ceylan , Erkut Erdem , Aykut Erdem

Read, Watch and Scream! Sound Generation from Text and Video

Despite the impressive progress of multimodal generative models, video-to-audio generation still suffers from limited performance and limits the flexibility to prioritize sound synthesis for specific objects within the scene. Conversely,…

Computer Vision and Pattern Recognition · Computer Science 2024-12-30 Yujin Jeong , Yunji Kim , Sanghyuk Chun , Jiyoung Lee

ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling

Text-to-audio (TTA) generation with fine-grained control signals, e.g., precise timing control or intelligible speech content, has been explored in recent works. However, constrained by data scarcity, their generation performance at scale…

Sound · Computer Science 2026-04-21 Yuxuan Jiang , Zehua Chen , Zeqian Ju , Yusheng Dai , Weibei Dou , Jun Zhu

Enhance audio generation controllability through representation similarity regularization

This paper presents an innovative approach to enhance control over audio generation by emphasizing the alignment between audio and text representations during model training. In the context of language model-based audio generation, the…

Sound · Computer Science 2023-09-19 Yangyang Shi , Gael Le Lan , Varun Nagaraja , Zhaoheng Ni , Xinhao Mei , Ernie Chang , Forrest Iandola , Yang Liu , Vikas Chandra

Audio Conditioning for Music Generation via Discrete Bottleneck Features

While most music generation models use textual or parametric conditioning (e.g. tempo, harmony, musical genre), we propose to condition a language model based music generation system with audio input. Our exploration involves two distinct…

Sound · Computer Science 2024-07-31 Simon Rouard , Yossi Adi , Jade Copet , Axel Roebel , Alexandre Défossez

SILA: Signal-to-Language Augmentation for Enhanced Control in Text-to-Audio Generation

The field of text-to-audio generation has seen significant advancements, and yet the ability to finely control the acoustic characteristics of generated audio remains under-explored. In this paper, we introduce a novel yet simple approach…

Sound · Computer Science 2024-12-16 Sonal Kumar , Prem Seetharaman , Justin Salamon , Dinesh Manocha , Oriol Nieto

Audiobook-CC: Controllable Long-context Speech Generation for Multicast Audiobook

Existing text-to-speech systems predominantly focus on single-sentence synthesis and lack adequate contextual modeling as well as fine-grained performance control capabilities for generating coherent multicast audiobooks. To address these…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-23 Min Liu , JingJing Yin , Xiang Zhang , Siyu Hao , Yanni Hu , Bin Lin , Yuan Feng , Hongbin Zhou , Jianhao Ye

MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation

Existing text-to-music models can produce high-quality audio with great diversity. However, textual prompts alone cannot precisely control temporal musical features such as chords and rhythm of the generated music. To address this…

Sound · Computer Science 2024-07-23 Yun-Han Lan , Wen-Yi Hsiao , Hao-Chung Cheng , Yi-Hsuan Yang

ArchiSound: Audio Generation with Diffusion

The recent surge in popularity of diffusion models for image generation has brought new attention to the potential of these models in other areas of media generation. One area that has yet to be fully explored is the application of…

Sound · Computer Science 2023-02-01 Flavio Schneider

Controllable Audio-Visual Viewpoint Generation from 360{\deg} Spatial Information

The generation of sounding videos has seen significant advancements with the advent of diffusion models. However, existing methods often lack the fine-grained control needed to generate viewpoint-specific content from larger, immersive…

Multimedia · Computer Science 2025-10-08 Christian Marinoni , Riccardo Fosco Gramaccioni , Eleonora Grassucci , Danilo Comminiello

Synthetic training set generation using text-to-audio models for environmental sound classification

In recent years, text-to-audio models have revolutionized the field of automatic audio generation. This paper investigates their application in generating synthetic datasets for training data-driven models. Specifically, this study analyzes…

Audio and Speech Processing · Electrical Eng. & Systems 2024-07-09 Francesca Ronchini , Luca Comanducci , Fabio Antonacci

Controllable Video-to-Music Generation with Multiple Time-Varying Conditions

Music enhances video narratives and emotions, driving demand for automatic video-to-music (V2M) generation. However, existing V2M methods relying solely on visual features or supplementary textual inputs generate music in a black-box…

Multimedia · Computer Science 2025-07-29 Junxian Wu , Weitao You , Heda Zuo , Dengming Zhang , Pei Chen , Lingyun Sun

Diffsound: Discrete Diffusion Model for Text-to-sound Generation

Generating sound effects that humans want is an important topic. However, there are few studies in this area for sound generation. In this study, we investigate generating sound conditioned on a text prompt and propose a novel text-to-sound…

Sound · Computer Science 2023-05-01 Dongchao Yang , Jianwei Yu , Helin Wang , Wen Wang , Chao Weng , Yuexian Zou , Dong Yu

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Large-scale multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio…

Sound · Computer Science 2023-01-31 Rongjie Huang , Jiawei Huang , Dongchao Yang , Yi Ren , Luping Liu , Mingze Li , Zhenhui Ye , Jinglin Liu , Xiang Yin , Zhou Zhao

Text-Driven Foley Sound Generation With Latent Diffusion Model

Foley sound generation aims to synthesise the background sound for multimedia content. Previous models usually employ a large development set with labels as input (e.g., single numbers or one-hot vector). In this work, we propose a…

Sound · Computer Science 2023-09-19 Yi Yuan , Haohe Liu , Xubo Liu , Xiyuan Kang , Peipei Wu , Mark D. Plumbley , Wenwu Wang

AudioX: A Unified Framework for Anything-to-Audio Generation

Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, and 2) large-scale, high-quality training data. As such, we…

Multimedia · Computer Science 2026-04-16 Zeyue Tian , Zhaoyang Liu , Yizhu Jin , Ruibin Yuan , Liumeng Xue , Xu Tan , Qifeng Chen , Wei Xue , Yike Guo