Related papers: Diff-SAGe: End-to-End Spatial Audio Generation Usi…

ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model

We introduce ImmerseDiffusion, an end-to-end generative audio model that produces 3D immersive soundscapes conditioned on the spatial, temporal, and environmental conditions of sound objects. ImmerseDiffusion is trained to generate…

Sound · Computer Science 2025-02-11 Mojtaba Heydari , Mehrez Souden , Bruno Conejo , Joshua Atkins

DiffAU: Diffusion-Based Ambisonics Upscaling

Spatial audio enhances immersion by reproducing 3D sound fields, with Ambisonics offering a scalable format for this purpose. While first-order Ambisonics (FOA) notably facilitates hardware-efficient acquisition and storage of sound fields…

Audio and Speech Processing · Electrical Eng. & Systems 2026-03-31 Amit Milstein , Nir Shlezinger , Boaz Rafaely

DynFOA: Generating First-Order Ambisonics with Conditional Diffusion for Dynamic and Acoustically Complex 360-Degree Videos

Spatial audio is crucial for immersive 360-degree video experiences, yet most 360-degree videos lack it due to the difficulty of capturing spatial audio during recording. Automatically generating spatial audio such as first-order ambisonics…

Sound · Computer Science 2026-04-13 Ziyu Luo , Lin Chen , Qiang Qu , Xiaoming Chen , Yiran Shen

DynFOA: Generating First-Order Ambisonics with Conditional Diffusion for Dynamic and Acoustically Complex 360-Degree Videos

Spatial audio is crucial for immersive 360-degree video experiences, yet most 360-degree videos lack it due to the difficulty of capturing spatial audio during recording. Automatically generating spatial audio such as first-order ambisonics…

Sound · Computer Science 2026-05-05 Ziyu Luo , Lin Chen , Qiang Qu , Xiaoming Chen , Yiran Shen

ViSAGe: Video-to-Spatial Audio Generation

Spatial audio is essential for enhancing the immersiveness of audio-visual experiences, yet its production typically demands complex recording systems and specialized expertise. In this work, we address a novel problem of generating…

Sound · Computer Science 2025-06-17 Jaeyeon Kim , Heeseung Yun , Gunhee Kim

Generating Moving 3D Soundscapes with Latent Diffusion Models

Spatial audio has become central to immersive applications such as VR/AR, cinema, and music. Existing generative audio models are largely limited to mono or stereo formats and cannot capture the full 3D localization cues available in…

Sound · Computer Science 2025-09-22 Christian Templin , Yanda Zhu , Hao Wang

Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Recently, diffusion models have achieved great success in mono-channel audio generation. However, when it comes to stereo audio generation, the soundscapes often have a complex scene of multiple objects and directions. Controlling stereo…

Sound · Computer Science 2025-02-26 Peiwen Sun , Sitong Cheng , Xiangtai Li , Zhen Ye , Huadai Liu , Honggang Zhang , Wei Xue , Yike Guo

Ambisonics Super-Resolution Using A Waveform-Domain Neural Network

Ambisonics is a spatial audio format describing a sound field. First-order Ambisonics (FOA) is a popular format comprising only four channels. This limited channel count comes at the expense of spatial accuracy. Ideally one would be able to…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-04 Ismael Nawfal , Symeon Delikaris Manias , Mehrez Souden , Juha Merimaa , Joshua Atkins , Elisabeth McMullin , Shadi Pirhosseinloo , Daniel Phillips

ImmersiveFlow: Stereo-to-7.1.4 spatial audio generation with flow matching

Immersive spatial audio has become increasingly critical for applications ranging from AR/VR to home entertainment and automotive sound systems. However, existing generative methods remain constrained to low-dimensional formats such as…

Audio and Speech Processing · Electrical Eng. & Systems 2026-01-21 Zining Liang , Runbang Wang , Xuzhou Ye , Qiuqiang Kong

Towards Spatial Audio Understanding via Question Answering

In this paper, we introduce a novel framework for spatial audio understanding of first-order ambisonic (FOA) signals through a question answering (QA) paradigm, aiming to extend the scope of sound event localization and detection (SELD)…

Sound · Computer Science 2025-07-15 Parthasaarathy Sudarsanam , Archontis Politis

ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then…

Computer Vision and Pattern Recognition · Computer Science 2025-12-03 Mengchen Zhang , Qi Chen , Tong Wu , Zihan Liu , Dahua Lin

OmniAudio: Generating Spatial Audio from 360-Degree Video

Traditional video-to-audio generation techniques primarily focus on perspective video and non-spatial audio, often missing the spatial cues necessary for accurately representing sound sources in 3D environments. To address this limitation,…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-04 Huadai Liu , Tianyi Luo , Kaicheng Luo , Qikai Jiang , Peiwen Sun , Jialei Wang , Rongjie Huang , Qian Chen , Wen Wang , Xiangtai Li , Shiliang Zhang , Zhijie Yan , Zhou Zhao , Wei Xue

SIRUP: A diffusion-based virtual upmixer of steering vectors for highly-directive spatialization with first-order ambisonics

This paper presents virtual upmixing of steering vectors captured by a fewer-channel spherical microphone array. This challenge has conventionally been addressed by recovering the directions and signals of sound sources from first-order…

Audio and Speech Processing · Electrical Eng. & Systems 2026-02-23 Emilio Picard , Diego Di Carlo , Aditya Arie Nugraha , Mathieu Fontaine , Kazuyoshi Yoshii

EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis

Audio diffusion models can synthesize a wide variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. This poses challenges when generating high-fidelity audio. In…

Sound · Computer Science 2023-11-21 Ge Zhu , Yutong Wen , Marc-André Carbonneau , Zhiyao Duan

FoleySpace: Vision-Aligned Binaural Spatial Audio Generation

Recently, with the advancement of AIGC, deep learning-based video-to-audio (V2A) technology has garnered significant attention. However, existing research mostly focuses on mono audio generation that lacks spatial perception, while the…

Sound · Computer Science 2025-08-22 Lei Zhao , Rujin Chen , Chi Zhang , Xiao-Lei Zhang , Xuelong Li

SpatialV2A: Visual-Guided High-fidelity Spatial Audio Generation

While video-to-audio generation has achieved remarkable progress in semantic and temporal alignment, most existing studies focus solely on these aspects, paying limited attention to the spatial perception and immersive quality of the…

Computer Vision and Pattern Recognition · Computer Science 2026-01-30 Yanan Wang , Linjie Ren , Zihao Li , Junyi Wang , Tian Gan

Audio Generation Through Score-Based Generative Modeling: Design Principles and Implementation

Diffusion models have emerged as powerful deep generative techniques, producing high-quality and diverse samples in applications in various domains including audio. While existing reviews provide overviews, there remains limited in-depth…

Sound · Computer Science 2026-01-16 Ge Zhu , Yutong Wen , Zhiyao Duan

Embedding-Space Diffusion for Zero-Shot Environmental Sound Classification

Zero-shot learning enables models to generalise to unseen classes by leveraging semantic information, bridging the gap between training and testing sets with non-overlapping classes. While much research has focused on zero-shot learning in…

Sound · Computer Science 2025-07-03 Ysobel Sims , Alexandre Mendes , Stephan Chalup

SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound

Generating combined visual and auditory sensory experiences is critical for the consumption of immersive content. Recent advances in neural generative models have enabled the creation of high-resolution content across multiple modalities…

Computer Vision and Pattern Recognition · Computer Science 2025-07-08 Rishit Dagli , Shivesh Prakash , Robert Wu , Houman Khosravani

SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models

We are witnessing a revolution in conditional image synthesis with the recent success of large scale text-to-image generation methods. This success also opens up new opportunities in controlling the generation and editing process using…

Computer Vision and Pattern Recognition · Computer Science 2024-05-03 Burak Can Biner , Farrin Marouf Sofian , Umur Berkay Karakaş , Duygu Ceylan , Erkut Erdem , Aykut Erdem