Related papers: AudioEditor: A Training-Free Diffusion-Based Audio…

AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models

Audio editing is applicable for various purposes, such as adding background sound effects, replacing a musical instrument, and repairing damaged audio. Recently, some diffusion-based methods achieved zero-shot audio editing by using a…

Sound · Computer Science 2023-04-06 Yuancheng Wang , Zeqian Ju , Xu Tan , Lei He , Zhizheng Wu , Jiang Bian , Sheng Zhao

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study,…

Sound · Computer Science 2023-09-12 Haohe Liu , Zehua Chen , Yi Yuan , Xinhao Mei , Xubo Liu , Danilo Mandic , Wenwu Wang , Mark D. Plumbley

MMEDIT: A Unified Framework for Multi-Type Audio Editing via Audio Language Model

Text-guided audio editing aims to modify specific acoustic events while strictly preserving non-target content. Despite recent progress, existing approaches remain fundamentally limited. Training-free methods often suffer from signal…

Sound · Computer Science 2026-01-21 Ye Tao , Wen Wu , Chao Zhang , Mengyue Wu , Shuai Wang , Xuenan Xu

Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

The immense scale of the recent large language models (LLM) allows many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero- and few-shot performance in many natural…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-30 Deepanway Ghosal , Navonil Majumder , Ambuj Mehrish , Soujanya Poria

ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling

Text-to-audio (TTA) generation with fine-grained control signals, e.g., precise timing control or intelligible speech content, has been explored in recent works. However, constrained by data scarcity, their generation performance at scale…

Sound · Computer Science 2026-04-21 Yuxuan Jiang , Zehua Chen , Zeqian Ju , Yusheng Dai , Weibei Dou , Jun Zhu

Guiding Audio Editing with Audio Language Model

Audio editing plays a central role in VR/AR immersion, virtual conferencing, sound design, and other interactive media. However, recent generative audio editing models depend on template-like instruction formats and are restricted to…

Sound · Computer Science 2025-09-29 Zitong Lan , Yiduo Hao , Mingmin Zhao

Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition

We introduce Audio-Agent, a multimodal framework for audio generation, editing and composition based on text or video inputs. Conventional approaches for text-to-audio (TTA) tasks often make single-pass inferences from text descriptions.…

Sound · Computer Science 2025-01-15 Zixuan Wang , Chi-Keung Tang , Yu-Wing Tai

RFM-Editing: Rectified Flow Matching for Text-guided Audio Editing

Diffusion models have shown remarkable progress in text-to-audio generation. However, text-guided audio editing remains in its early stages. This task focuses on modifying the target content within an audio signal while preserving the rest,…

Sound · Computer Science 2026-04-17 Liting Gao , Yi Yuan , Yaru Chen , Yuelan Cheng , Zhenbo Li , Juan Wen , Shubin Zhang , Wenwu Wang

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

Recent advancements in diffusion models and large language models (LLMs) have significantly propelled the field of AIGC. Text-to-Audio (TTA), a burgeoning AIGC application designed to generate audio from natural language prompts, is…

Sound · Computer Science 2024-01-03 Jinlong Xue , Yayue Deng , Yingming Gao , Ya Li

AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions

Current Text-to-audio (TTA) models mainly use coarse text descriptions as inputs to generate audio, which hinders models from generating audio with fine-grained control of content and style. Some studies try to improve the granularity by…

Audio and Speech Processing · Electrical Eng. & Systems 2025-04-01 Yuanyuan Wang , Hangting Chen , Dongchao Yang , Zhiyong Wu , Xixin Wu

Prompt-guided Precise Audio Editing with Diffusion Models

Audio editing involves the arbitrary manipulation of audio content through precise control. Although text-guided diffusion models have made significant advancements in text-to-audio generation, they still face challenges in finding a…

Sound · Computer Science 2024-06-10 Manjie Xu , Chenxing Li , Duzhen zhang , Dan Su , Wei Liang , Dong Yu

Audio-Guided Visual Editing with Complex Multi-Modal Prompts

Visual editing with diffusion models has made significant progress but often struggles with complex scenarios that textual guidance alone could not adequately describe, highlighting the need for additional non-text editing prompts. In this…

Computer Vision and Pattern Recognition · Computer Science 2025-08-29 Hyeonyu Kim , Seokhoon Jeong , Seonghee Han , Chanhyuk Choi , Taehwan Kim

MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models

Recent advances in text-to-music generation models have opened new avenues in musical creativity. However, music generation usually involves iterative refinements, and how to edit the generated music remains a significant challenge. This…

Sound · Computer Science 2024-05-29 Yixiao Zhang , Yukara Ikemiya , Gus Xia , Naoki Murata , Marco A. Martínez-Ramírez , Wei-Hsiang Liao , Yuki Mitsufuji , Simon Dixon

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-23 Jiarui Hai , Yong Xu , Hao Zhang , Chenxing Li , Helin Wang , Mounya Elhilali , Dong Yu

Virtual Consistency for Audio Editing

Free-form, text-based audio editing remains a persistent challenge, despite progress in inversion-based neural methods. Current approaches rely on slow inversion procedures, limiting their practicality. We present a virtual-consistency…

Sound · Computer Science 2025-09-23 Matthieu Cervera , Francesco Paissan , Mirco Ravanelli , Cem Subakan

DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment

Text-to-audio (TTA) generation is a recent popular problem that aims to synthesize general audio given text descriptions. Previous methods utilized latent diffusion models to learn audio embedding in a latent space with text embedding as…

Computer Vision and Pattern Recognition · Computer Science 2023-05-23 Shentong Mo , Jing Shi , Yapeng Tian

SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models

We are witnessing a revolution in conditional image synthesis with the recent success of large scale text-to-image generation methods. This success also opens up new opportunities in controlling the generation and editing process using…

Computer Vision and Pattern Recognition · Computer Science 2024-05-03 Burak Can Biner , Farrin Marouf Sofian , Umur Berkay Karakaş , Duygu Ceylan , Erkut Erdem , Aykut Erdem

Efficient Diffusion-Driven Corruption Editor for Test-Time Adaptation

Test-time adaptation (TTA) addresses the unforeseen distribution shifts occurring during test time. In TTA, performance, memory consumption, and time consumption are crucial considerations. A recent diffusion-based TTA approach for…

Computer Vision and Pattern Recognition · Computer Science 2024-07-12 Yeongtak Oh , Jonghyun Lee , Jooyoung Choi , Dahuin Jung , Uiwon Hwang , Sungroh Yoon

DreamAudio: Customized Text-to-Audio Generation with Diffusion Models

With the development of large-scale diffusion-based and language-modeling-based generative models, impressive progress has been achieved in text-to-audio generation. Despite producing high-quality outputs, existing text-to-audio models…

Sound · Computer Science 2026-04-28 Yi Yuan , Xubo Liu , Haohe Liu , Xiyuan Kang , Zhuo Chen , Yuxuan Wang , Mark D. Plumbley , Wenwu Wang

FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models

Editing real images using a pre-trained text-to-image (T2I) diffusion/flow model often involves inverting the image into its corresponding noise map. However, inversion by itself is typically insufficient for obtaining satisfactory results,…

Computer Vision and Pattern Recognition · Computer Science 2025-07-23 Vladimir Kulikov , Matan Kleiner , Inbar Huberman-Spiegelglas , Tomer Michaeli