English

Multi-scale Multi-instance Visual Sound Localization and Segmentation

Computer Vision and Pattern Recognition 2024-09-04 v1 Machine Learning Multimedia Sound Audio and Speech Processing

Abstract

Visual sound localization is a typical and challenging problem that predicts the location of objects corresponding to the sound source in a video. Previous methods mainly used the audio-visual association between global audio and one-scale visual features to localize sounding objects in each image. Despite their promising performance, they omitted multi-scale visual features of the corresponding image, and they cannot learn discriminative regions compared to ground truths. To address this issue, we propose a novel multi-scale multi-instance visual sound localization framework, namely M2VSL, that can directly learn multi-scale semantic features associated with sound sources from the input image to localize sounding objects. Specifically, our M2VSL leverages learnable multi-scale visual features to align audio-visual representations at multi-level locations of the corresponding image. We also introduce a novel multi-scale multi-instance transformer to dynamically aggregate multi-scale cross-modal representations for visual sound localization. We conduct extensive experiments on VGGSound-Instruments, VGG-Sound Sources, and AVSBench benchmarks. The results demonstrate that the proposed M2VSL can achieve state-of-the-art performance on sounding object localization and segmentation.

Keywords

Cite

@article{arxiv.2409.00486,
  title  = {Multi-scale Multi-instance Visual Sound Localization and Segmentation},
  author = {Shentong Mo and Haofan Wang},
  journal= {arXiv preprint arXiv:2409.00486},
  year   = {2024}
}
R2 v1 2026-06-28T18:30:02.176Z