English
Related papers

Related papers: Audio Generation with Multiple Conditional Diffusi…

200 papers

In recent years, image generation has shown a great leap in performance, where diffusion models play a central role. Although generating high-quality images, such models are mainly conditioned on textual descriptions. This begs the…

Sound · Computer Science 2023-05-23 Guy Yariv , Itai Gat , Lior Wolf , Yossi Adi , Idan Schwartz

We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs. AudioGen operates…

Text-to-music generation models are now capable of generating high-quality music audio in broad styles. However, text control is primarily suitable for the manipulation of global musical attributes like genre, mood, and tempo, and is less…

Sound · Computer Science 2023-11-14 Shih-Lun Wu , Chris Donahue , Shinji Watanabe , Nicholas J. Bryan

We demonstrate how conditional generation from diffusion models can be used to tackle a variety of realistic tasks in the production of music in 44.1kHz stereo audio with sampling-time guidance. The scenarios we consider include…

Sound · Computer Science 2023-12-06 Mark Levy , Bruno Di Giorgi , Floris Weers , Angelos Katharopoulos , Tom Nickson

We are witnessing a revolution in conditional image synthesis with the recent success of large scale text-to-image generation methods. This success also opens up new opportunities in controlling the generation and editing process using…

Computer Vision and Pattern Recognition · Computer Science 2024-05-03 Burak Can Biner , Farrin Marouf Sofian , Umur Berkay Karakaş , Duygu Ceylan , Erkut Erdem , Aykut Erdem

Despite the impressive progress of multimodal generative models, video-to-audio generation still suffers from limited performance and limits the flexibility to prioritize sound synthesis for specific objects within the scene. Conversely,…

Computer Vision and Pattern Recognition · Computer Science 2024-12-30 Yujin Jeong , Yunji Kim , Sanghyuk Chun , Jiyoung Lee

Text-to-audio (TTA) generation with fine-grained control signals, e.g., precise timing control or intelligible speech content, has been explored in recent works. However, constrained by data scarcity, their generation performance at scale…

Sound · Computer Science 2026-04-21 Yuxuan Jiang , Zehua Chen , Zeqian Ju , Yusheng Dai , Weibei Dou , Jun Zhu

This paper presents an innovative approach to enhance control over audio generation by emphasizing the alignment between audio and text representations during model training. In the context of language model-based audio generation, the…

While most music generation models use textual or parametric conditioning (e.g. tempo, harmony, musical genre), we propose to condition a language model based music generation system with audio input. Our exploration involves two distinct…

Sound · Computer Science 2024-07-31 Simon Rouard , Yossi Adi , Jade Copet , Axel Roebel , Alexandre Défossez

The field of text-to-audio generation has seen significant advancements, and yet the ability to finely control the acoustic characteristics of generated audio remains under-explored. In this paper, we introduce a novel yet simple approach…

Sound · Computer Science 2024-12-16 Sonal Kumar , Prem Seetharaman , Justin Salamon , Dinesh Manocha , Oriol Nieto

Existing text-to-speech systems predominantly focus on single-sentence synthesis and lack adequate contextual modeling as well as fine-grained performance control capabilities for generating coherent multicast audiobooks. To address these…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-23 Min Liu , JingJing Yin , Xiang Zhang , Siyu Hao , Yanni Hu , Bin Lin , Yuan Feng , Hongbin Zhou , Jianhao Ye

Existing text-to-music models can produce high-quality audio with great diversity. However, textual prompts alone cannot precisely control temporal musical features such as chords and rhythm of the generated music. To address this…

Sound · Computer Science 2024-07-23 Yun-Han Lan , Wen-Yi Hsiao , Hao-Chung Cheng , Yi-Hsuan Yang

The recent surge in popularity of diffusion models for image generation has brought new attention to the potential of these models in other areas of media generation. One area that has yet to be fully explored is the application of…

Sound · Computer Science 2023-02-01 Flavio Schneider

The generation of sounding videos has seen significant advancements with the advent of diffusion models. However, existing methods often lack the fine-grained control needed to generate viewpoint-specific content from larger, immersive…

In recent years, text-to-audio models have revolutionized the field of automatic audio generation. This paper investigates their application in generating synthetic datasets for training data-driven models. Specifically, this study analyzes…

Audio and Speech Processing · Electrical Eng. & Systems 2024-07-09 Francesca Ronchini , Luca Comanducci , Fabio Antonacci

Music enhances video narratives and emotions, driving demand for automatic video-to-music (V2M) generation. However, existing V2M methods relying solely on visual features or supplementary textual inputs generate music in a black-box…

Multimedia · Computer Science 2025-07-29 Junxian Wu , Weitao You , Heda Zuo , Dengming Zhang , Pei Chen , Lingyun Sun

Generating sound effects that humans want is an important topic. However, there are few studies in this area for sound generation. In this study, we investigate generating sound conditioned on a text prompt and propose a novel text-to-sound…

Sound · Computer Science 2023-05-01 Dongchao Yang , Jianwei Yu , Helin Wang , Wen Wang , Chao Weng , Yuexian Zou , Dong Yu

Large-scale multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio…

Foley sound generation aims to synthesise the background sound for multimedia content. Previous models usually employ a large development set with labels as input (e.g., single numbers or one-hot vector). In this work, we propose a…

Sound · Computer Science 2023-09-19 Yi Yuan , Haohe Liu , Xubo Liu , Xiyuan Kang , Peipei Wu , Mark D. Plumbley , Wenwu Wang

Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, and 2) large-scale, high-quality training data. As such, we…

Multimedia · Computer Science 2026-04-16 Zeyue Tian , Zhaoyang Liu , Yizhu Jin , Ruibin Yuan , Liumeng Xue , Xu Tan , Qifeng Chen , Wei Xue , Yike Guo
‹ Prev 1 2 3 10 Next ›