Related papers: Audio-Visual Floorplan Reconstruction

Echo-Reconstruction: Audio-Augmented 3D Scene Reconstruction

Reflective and textureless surfaces such as windows, mirrors, and walls can be a challenge for object and scene reconstruction. These surfaces are often poorly reconstructed and filled with depth discontinuities and holes, making it…

Computer Vision and Pattern Recognition · Computer Science 2021-10-07 Justin Wilson , Nicholas Rewkowski , Ming C. Lin , Henry Fuchs

VAIR: Visuo-Acoustic Implicit Representations for Low-Cost, Multi-Modal Transparent Surface Reconstruction in Indoor Scenes

Mobile robots operating indoors must be prepared to navigate challenging scenes that contain transparent surfaces. This paper proposes a novel method for the fusion of acoustic and visual sensing modalities through implicit neural…

Computer Vision and Pattern Recognition · Computer Science 2024-11-08 Advaith V. Sethuraman , Onur Bagoren , Harikrishnan Seetharaman , Dalton Richardson , Joseph Taylor , Katherine A. Skinner

MAGIC: Map-Guided Few-Shot Audio-Visual Acoustics Modeling

Few-shot audio-visual acoustics modeling seeks to synthesize the room impulse response in arbitrary locations with few-shot observations. To sufficiently exploit the provided few-shot data for accurate acoustic modeling, we present a…

Computer Vision and Pattern Recognition · Computer Science 2024-05-24 Diwei Huang , Kunyang Lin , Peihao Chen , Qing Du , Mingkui Tan

Learning to Set Waypoints for Audio-Visual Navigation

In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e.g., a phone ringing in another room). Existing models learn to act at a fixed…

Computer Vision and Pattern Recognition · Computer Science 2021-02-12 Changan Chen , Sagnik Majumder , Ziad Al-Halah , Ruohan Gao , Santhosh Kumar Ramakrishnan , Kristen Grauman

Real-time 3-D Mapping with Estimating Acoustic Materials

This paper proposes a real-time system integrating an acoustic material estimation from visual appearance and an on-the-fly mapping in the 3-dimension. The proposed method estimates the acoustic materials of surroundings in indoor scenes…

Robotics · Computer Science 2019-09-17 Taeyoung Kim , Youngsun Kwon , Sung-eui Yoon

Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment

How does audio describe the world around us? In this paper, we propose a method for generating an image of a scene from sound. Our method addresses the challenges of dealing with the large gaps that often exist between sight and sound. We…

Computer Vision and Pattern Recognition · Computer Science 2023-03-31 Kim Sung-Bin , Arda Senocak , Hyunwoo Ha , Andrew Owens , Tae-Hyun Oh

Vision Language Models Can Parse Floor Plan Maps

Vision language models (VLMs) can simultaneously reason about images and texts to tackle many tasks, from visual question answering to image captioning. This paper focuses on map parsing, a novel task that is unexplored within the VLM…

Robotics · Computer Science 2025-11-26 David DeFazio , Hrudayangam Mehta , Meng Wang , Ping Yang , Jeremy Blackburn , Shiqi Zhang

From Waveforms to Pixels: A Survey on Audio-Visual Segmentation

Audio-Visual Segmentation (AVS) aims to identify and segment sound-producing objects in videos by leveraging both visual and audio modalities. It has emerged as a significant research area in multimodal perception, enabling fine-grained…

Computer Vision and Pattern Recognition · Computer Science 2025-08-07 Jia Li , Yapeng Tian

Audio Visual Language Maps for Robot Navigation

While interacting in the world is a multi-sensory experience, many robots continue to predominantly rely on visual perception to map and navigate in their environments. In this work, we propose Audio-Visual-Language Maps (AVLMaps), a…

Robotics · Computer Science 2023-03-28 Chenguang Huang , Oier Mees , Andy Zeng , Wolfram Burgard

SoundSpaces: Audio-Visual Navigation in 3D Environments

Moving around in the world is naturally a multisensory experience, but today's embodied agents are deaf---restricted to solely their visual perception of the environment. We introduce audio-visual navigation for complex, acoustically and…

Computer Vision and Pattern Recognition · Computer Science 2020-08-25 Changan Chen , Unnat Jain , Carl Schissler , Sebastia Vicenc Amengual Gari , Ziad Al-Halah , Vamsi Krishna Ithapu , Philip Robinson , Kristen Grauman

Floorplan-Aware Camera Poses Refinement

Processing large indoor scenes is a challenging task, as scan registration and camera trajectory estimation methods accumulate errors across time. As a result, the quality of reconstructed scans is insufficient for some applications, such…

Computer Vision and Pattern Recognition · Computer Science 2022-10-11 Anna Sokolova , Filipp Nikitin , Anna Vorontsova , Anton Konushin

AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation

We introduce AudioScopeV2, a state-of-the-art universal audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify…

Sound · Computer Science 2022-07-22 Efthymios Tzinis , Scott Wisdom , Tal Remez , John R. Hershey

Image-Plane Geometric Decoding for View-Invariant Indoor Scene Reconstruction

Volume-based indoor scene reconstruction methods offer superior generalization capability and real-time deployment potential. However, existing methods rely on multi-view pixel back-projection ray intersections as weak geometric constraints…

Computer Vision and Pattern Recognition · Computer Science 2025-10-28 Mingyang Li , Yimeng Fan , Changsong Liu , Lixue Xu , Xin Wang , Yanyan Liu , Wei Zhang

Learning Audio-Visual Dereverberation

Reverberation not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition. Prior work attempts to remove reverberation based on the audio modality only. Our idea is to…

Sound · Computer Science 2023-03-15 Changan Chen , Wei Sun , David Harwath , Kristen Grauman

SceneAligner: 3D-Grounded Floorplan Localization in the Wild

Many public buildings provide floorplans with a "you are here" indicator to help visitors orient themselves. Floorplan localization seeks to computationally replicate this capability by determining where visual observations were captured…

Computer Vision and Pattern Recognition · Computer Science 2026-05-22 Junhyeong Cho , Ruojin Cai , Hadar Averbuch-Elor

Telling Left from Right: Learning Spatial Correspondence of Sight and Sound

Self-supervised audio-visual learning aims to capture useful representations of video by leveraging correspondences between visual and audio inputs. Existing approaches have focused primarily on matching semantic information between the…

Computer Vision and Pattern Recognition · Computer Science 2020-06-15 Karren Yang , Bryan Russell , Justin Salamon

Can a Robot Hear the Shape and Dimensions of a Room?

Knowing the geometry of a space is desirable for many applications, e.g. sound source localization, sound field reproduction or auralization. In circumstances where only acoustic signals can be obtained, estimating the geometry of a room is…

Sound · Computer Science 2019-07-03 Linh Nguyen , Jaime Valls Miro , Xiaojun Qiu

Look, Listen, and Act: Towards Audio-Visual Embodied Navigation

A crucial ability of mobile intelligent agents is to integrate the evidence from multiple sensory inputs in an environment and to make a sequence of actions to reach their goals. In this paper, we attempt to approach the problem of…

Computer Vision and Pattern Recognition · Computer Science 2020-03-10 Chuang Gan , Yiwei Zhang , Jiajun Wu , Boqing Gong , Joshua B. Tenenbaum

Multi-Channel Replay Speech Detection using Acoustic Maps

Replay attacks remain a critical vulnerability for automatic speaker verification systems, particularly in real-time voice assistant applications. In this work, we propose acoustic maps as a novel spatial feature representation for replay…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-21 Michael Neri , Tuomas Virtanen

Differentiable Room Acoustic Rendering with Multi-View Vision Priors

An immersive acoustic experience enabled by spatial audio is just as crucial as the visual aspect in creating realistic virtual environments. However, existing methods for room impulse response estimation rely either on data-demanding…

Computer Vision and Pattern Recognition · Computer Science 2025-08-19 Derong Jin , Ruohan Gao