English
Related papers

Related papers: AudioEditor: A Training-Free Diffusion-Based Audio…

200 papers

Audio editing is applicable for various purposes, such as adding background sound effects, replacing a musical instrument, and repairing damaged audio. Recently, some diffusion-based methods achieved zero-shot audio editing by using a…

Sound · Computer Science 2023-04-06 Yuancheng Wang , Zeqian Ju , Xu Tan , Lei He , Zhizheng Wu , Jiang Bian , Sheng Zhao

Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study,…

Sound · Computer Science 2023-09-12 Haohe Liu , Zehua Chen , Yi Yuan , Xinhao Mei , Xubo Liu , Danilo Mandic , Wenwu Wang , Mark D. Plumbley

Text-guided audio editing aims to modify specific acoustic events while strictly preserving non-target content. Despite recent progress, existing approaches remain fundamentally limited. Training-free methods often suffer from signal…

Sound · Computer Science 2026-01-21 Ye Tao , Wen Wu , Chao Zhang , Mengyue Wu , Shuai Wang , Xuenan Xu

The immense scale of the recent large language models (LLM) allows many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero- and few-shot performance in many natural…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-30 Deepanway Ghosal , Navonil Majumder , Ambuj Mehrish , Soujanya Poria

Text-to-audio (TTA) generation with fine-grained control signals, e.g., precise timing control or intelligible speech content, has been explored in recent works. However, constrained by data scarcity, their generation performance at scale…

Sound · Computer Science 2026-04-21 Yuxuan Jiang , Zehua Chen , Zeqian Ju , Yusheng Dai , Weibei Dou , Jun Zhu

Audio editing plays a central role in VR/AR immersion, virtual conferencing, sound design, and other interactive media. However, recent generative audio editing models depend on template-like instruction formats and are restricted to…

Sound · Computer Science 2025-09-29 Zitong Lan , Yiduo Hao , Mingmin Zhao

We introduce Audio-Agent, a multimodal framework for audio generation, editing and composition based on text or video inputs. Conventional approaches for text-to-audio (TTA) tasks often make single-pass inferences from text descriptions.…

Sound · Computer Science 2025-01-15 Zixuan Wang , Chi-Keung Tang , Yu-Wing Tai

Diffusion models have shown remarkable progress in text-to-audio generation. However, text-guided audio editing remains in its early stages. This task focuses on modifying the target content within an audio signal while preserving the rest,…

Sound · Computer Science 2026-04-17 Liting Gao , Yi Yuan , Yaru Chen , Yuelan Cheng , Zhenbo Li , Juan Wen , Shubin Zhang , Wenwu Wang

Recent advancements in diffusion models and large language models (LLMs) have significantly propelled the field of AIGC. Text-to-Audio (TTA), a burgeoning AIGC application designed to generate audio from natural language prompts, is…

Sound · Computer Science 2024-01-03 Jinlong Xue , Yayue Deng , Yingming Gao , Ya Li

Current Text-to-audio (TTA) models mainly use coarse text descriptions as inputs to generate audio, which hinders models from generating audio with fine-grained control of content and style. Some studies try to improve the granularity by…

Audio and Speech Processing · Electrical Eng. & Systems 2025-04-01 Yuanyuan Wang , Hangting Chen , Dongchao Yang , Zhiyong Wu , Xixin Wu

Audio editing involves the arbitrary manipulation of audio content through precise control. Although text-guided diffusion models have made significant advancements in text-to-audio generation, they still face challenges in finding a…

Sound · Computer Science 2024-06-10 Manjie Xu , Chenxing Li , Duzhen zhang , Dan Su , Wei Liang , Dong Yu

Visual editing with diffusion models has made significant progress but often struggles with complex scenarios that textual guidance alone could not adequately describe, highlighting the need for additional non-text editing prompts. In this…

Computer Vision and Pattern Recognition · Computer Science 2025-08-29 Hyeonyu Kim , Seokhoon Jeong , Seonghee Han , Chanhyuk Choi , Taehwan Kim

Recent advances in text-to-music generation models have opened new avenues in musical creativity. However, music generation usually involves iterative refinements, and how to edit the generated music remains a significant challenge. This…

We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-23 Jiarui Hai , Yong Xu , Hao Zhang , Chenxing Li , Helin Wang , Mounya Elhilali , Dong Yu

Free-form, text-based audio editing remains a persistent challenge, despite progress in inversion-based neural methods. Current approaches rely on slow inversion procedures, limiting their practicality. We present a virtual-consistency…

Sound · Computer Science 2025-09-23 Matthieu Cervera , Francesco Paissan , Mirco Ravanelli , Cem Subakan

Text-to-audio (TTA) generation is a recent popular problem that aims to synthesize general audio given text descriptions. Previous methods utilized latent diffusion models to learn audio embedding in a latent space with text embedding as…

Computer Vision and Pattern Recognition · Computer Science 2023-05-23 Shentong Mo , Jing Shi , Yapeng Tian

We are witnessing a revolution in conditional image synthesis with the recent success of large scale text-to-image generation methods. This success also opens up new opportunities in controlling the generation and editing process using…

Computer Vision and Pattern Recognition · Computer Science 2024-05-03 Burak Can Biner , Farrin Marouf Sofian , Umur Berkay Karakaş , Duygu Ceylan , Erkut Erdem , Aykut Erdem

Test-time adaptation (TTA) addresses the unforeseen distribution shifts occurring during test time. In TTA, performance, memory consumption, and time consumption are crucial considerations. A recent diffusion-based TTA approach for…

Computer Vision and Pattern Recognition · Computer Science 2024-07-12 Yeongtak Oh , Jonghyun Lee , Jooyoung Choi , Dahuin Jung , Uiwon Hwang , Sungroh Yoon

With the development of large-scale diffusion-based and language-modeling-based generative models, impressive progress has been achieved in text-to-audio generation. Despite producing high-quality outputs, existing text-to-audio models…

Sound · Computer Science 2026-04-28 Yi Yuan , Xubo Liu , Haohe Liu , Xiyuan Kang , Zhuo Chen , Yuxuan Wang , Mark D. Plumbley , Wenwu Wang

Editing real images using a pre-trained text-to-image (T2I) diffusion/flow model often involves inverting the image into its corresponding noise map. However, inversion by itself is typically insufficient for obtaining satisfactory results,…

Computer Vision and Pattern Recognition · Computer Science 2025-07-23 Vladimir Kulikov , Matan Kleiner , Inbar Huberman-Spiegelglas , Tomer Michaeli
‹ Prev 1 2 3 10 Next ›