Related papers: DiffATR: Diffusion-based Generative Modeling for A…

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

Existing text-video retrieval solutions are, in essence, discriminant models focused on maximizing the conditional likelihood, i.e., p(candidates|query). While straightforward, this de facto paradigm overlooks the underlying data…

Computer Vision and Pattern Recognition · Computer Science 2023-08-22 Peng Jin , Hao Li , Zesen Cheng , Kehan Li , Xiangyang Ji , Chang Liu , Li Yuan , Jie Chen

Diffusion Models for Audio Restoration

With the development of audio playback devices and fast data transmission, the demand for high sound quality is rising for both entertainment and communications. In this quest for better sound quality, challenges emerge from distortions and…

Audio and Speech Processing · Electrical Eng. & Systems 2024-11-12 Jean-Marie Lemercier , Julius Richter , Simon Welker , Eloi Moliner , Vesa Välimäki , Timo Gerkmann

DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation

Several recent studies have attempted to autoregressively generate continuous speech representations without discrete speech tokens by combining diffusion and autoregressive models, yet they often face challenges with excessive…

Audio and Speech Processing · Electrical Eng. & Systems 2025-12-09 Dongya Jia , Zhuo Chen , Jiawei Chen , Chenpeng Du , Jian Wu , Jian Cong , Xiaobin Zhuang , Chumin Li , Zhen Wei , Yuping Wang , Yuxuan Wang

Analysing Diffusion-based Generative Approaches versus Discriminative Approaches for Speech Restoration

Diffusion-based generative models have had a high impact on the computer vision and speech processing communities these past years. Besides data generation tasks, they have also been employed for data restoration tasks like speech…

Audio and Speech Processing · Electrical Eng. & Systems 2023-03-17 Jean-Marie Lemercier , Julius Richter , Simon Welker , Timo Gerkmann

From Noise to Order: Learning to Rank via Denoising Diffusion

In information retrieval (IR), learning-to-rank (LTR) methods have traditionally limited themselves to discriminative machine learning approaches that model the probability of the document being relevant to the query given some feature…

Information Retrieval · Computer Science 2026-02-13 Sajad Ebrahimi , Bhaskar Mitra , Negar Arabzadeh , Ye Yuan , Haolun Wu , Fattane Zarrinkalam , Ebrahim Bagheri

GD-Retriever: Controllable Generative Text-Music Retrieval with Diffusion Models

Multimodal contrastive models have achieved strong performance in text-audio retrieval and zero-shot settings, but improving joint embedding spaces remains an active research area. Less attention has been given to making these systems…

Sound · Computer Science 2025-06-25 Julien Guinot , Elio Quinton , György Fazekas

AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion

Diffusion models have significantly improved the quality and diversity of audio generation but are hindered by slow inference speed. Rectified flow enhances inference speed by learning straight-line ordinary differential equation (ODE)…

Sound · Computer Science 2025-05-29 Junqi Zhao , Jinzheng Zhao , Haohe Liu , Yun Chen , Lu Han , Xubo Liu , Mark Plumbley , Wenwu Wang

DiffPhase: Generative Diffusion-based STFT Phase Retrieval

Diffusion probabilistic models have been recently used in a variety of tasks, including speech enhancement and synthesis. As a generative approach, diffusion models have been shown to be especially suitable for imputation problems, where…

Audio and Speech Processing · Electrical Eng. & Systems 2023-06-05 Tal Peer , Simon Welker , Timo Gerkmann

Extract and Diffuse: Latent Integration for Improved Diffusion-based Speech and Vocal Enhancement

Diffusion-based generative models have recently achieved remarkable results in speech and vocal enhancement due to their ability to model complex speech data distributions. While these models generalize well to unseen acoustic environments,…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-23 Yudong Yang , Zhan Liu , Wenyi Yu , Guangzhi Sun , Qiuqiang Kong , Chao Zhang

ArchiSound: Audio Generation with Diffusion

The recent surge in popularity of diffusion models for image generation has brought new attention to the potential of these models in other areas of media generation. One area that has yet to be fully explored is the application of…

Sound · Computer Science 2023-02-01 Flavio Schneider

Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation

Autoregressive (AR) models with diffusion heads have recently achieved strong text-to-audio performance, yet their iterative decoding and multi-step sampling process introduce high-latency issues. To address this bottleneck, we propose a…

Sound · Computer Science 2026-05-04 Kuan-Po Huang , Bo-Ru Lu , Byeonggeun Kim , Mihee Lee , Zalan Fabian , Renard Korzeniowski , Qingming Tang , Greg Ver Steeg , Hung-yi Lee , Chieh-Chi Kao , Chao Wang

Table-to-Text Generation with Pretrained Diffusion Models

Diffusion models have demonstrated significant potential in achieving state-of-the-art performance across various text generation tasks. In this systematic study, we investigate their application to the table-to-text problem by adapting the…

Computation and Language · Computer Science 2024-09-24 Aleksei S. Krylov , Oleg D. Somov

Efficient and Fast Generative-Based Singing Voice Separation using a Latent Diffusion Model

Extracting individual elements from music mixtures is a valuable tool for music production and practice. While neural networks optimized to mask or transform mixture spectrograms into the individual source(s) have been the leading approach,…

Sound · Computer Science 2025-11-26 Genís Plaja-Roglans , Yun-Ning Hung , Xavier Serra , Igor Pereira

DiffAttack: Diffusion-based Timbre-reserved Adversarial Attack in Speaker Identification

Being a form of biometric identification, the security of the speaker identification (SID) system is of utmost importance. To better understand the robustness of SID systems, we aim to perform more realistic attacks in SID, which are…

Sound · Computer Science 2025-01-10 Qing Wang , Jixun Yao , Zhaokai Sun , Pengcheng Guo , Lei Xie , John H. L. Hansen

Discrete Diffusion Models for Language Generation

Diffusion models have emerged as a powerful class of generative models, achieving state-of-the-art results in continuous data domains such as image and video generation. Their core mechanism involves a forward diffusion process that…

Computation and Language · Computer Science 2025-07-10 Ashen Weligalle

Audio Generation Through Score-Based Generative Modeling: Design Principles and Implementation

Diffusion models have emerged as powerful deep generative techniques, producing high-quality and diverse samples in applications in various domains including audio. While existing reviews provide overviews, there remains limited in-depth…

Sound · Computer Science 2026-01-16 Ge Zhu , Yutong Wen , Zhiyao Duan

DiffuGR: Generative Document Retrieval with Diffusion Language Models

Generative retrieval (GR) reframes document retrieval as an end-to-end task of generating sequential document identifiers (DocIDs). Existing GR methods predominantly rely on left-to-right auto-regressive decoding, which suffers from two…

Information Retrieval · Computer Science 2026-02-04 Xinpeng Zhao , Zhaochun Ren , Yukun Zhao , Zhenyang Li , Mengqi Zhang , Jun Feng , Ran Chen , Ying Zhou , Zhumin Chen , Shuaiqiang Wang , Dawei Yin , Xin Xin

Diffusion Reconstruction towards Generalizable Audio Deepfake Detection

Achieving robust generalization against unseen attacks remains a challenge in Audio Deepfake Detection (ADD), driven by the rapid evolution of generative models. To address this, we propose a framework centered on hard sample…

Sound · Computer Science 2026-04-30 Bo Cheng , Songjun Cao , Xiaoming Zhang , Jie Chen , Long Ma , Fei Chen

Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

Recently, denoising diffusion probabilistic models and generative score matching have shown high potential in modelling complex data distributions while stochastic calculus has provided a unified point of view on these techniques allowing…

Machine Learning · Computer Science 2021-08-06 Vadim Popov , Ivan Vovk , Vladimir Gogoryan , Tasnima Sadekova , Mikhail Kudinov

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

Most existing audio-text retrieval (ATR) approaches typically rely on a single-level interaction to associate audio and text, limiting their ability to align different modalities and leading to suboptimal matches. In this work, we present a…

Sound · Computer Science 2025-05-06 Yifei Xin , Zhihong Zhu , Xuxin Cheng , Xusheng Yang , Yuexian Zou