Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models

Saksham Singh Kushwaha; Jianbo Ma; Mark R. P. Thomas; Yapeng Tian; Avery Bruni

doi:10.1109/ICASSP49660.2025.10888882

Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models

Sound 2025-07-16 v1 Audio and Speech Processing

Authors: Saksham Singh Kushwaha , Jianbo Ma , Mark R. P. Thomas , Yapeng Tian , Avery Bruni

View on arXiv ↗ PDF ↗ DOI ↗

Abstract

Spatial audio is a crucial component in creating immersive experiences. Traditional simulation-based approaches to generate spatial audio rely on expertise, have limited scalability, and assume independence between semantic and spatial information. To address these issues, we explore end-to-end spatial audio generation. We introduce and formulate a new task of generating first-order Ambisonics (FOA) given a sound category and sound source spatial location. We propose Diff-SAGe, an end-to-end, flow-based diffusion-transformer model for this task. Diff-SAGe utilizes a complex spectrogram representation for FOA, preserving the phase information crucial for accurate spatial cues. Additionally, a multi-conditional encoder integrates the input conditions into a unified representation, guiding the generation of FOA waveforms from noise. Through extensive evaluations on two datasets, we demonstrate that our method consistently outperforms traditional simulation-based baselines across both objective and subjective metrics.

Keywords

audio generation

Cite

@article{arxiv.2410.11299,
  title  = {Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models},
  author = {Saksham Singh Kushwaha and Jianbo Ma and Mark R. P. Thomas and Yapeng Tian and Avery Bruni},
  journal= {arXiv preprint arXiv:2410.11299},
  year   = {2025}
}

Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models

Abstract

Keywords

Cite

Related papers