English

Weakly-Supervised Audio-Visual Segmentation

Computer Vision and Pattern Recognition 2023-11-28 v1 Artificial Intelligence Machine Learning Multimedia Sound Audio and Speech Processing

Abstract

Audio-visual segmentation is a challenging task that aims to predict pixel-level masks for sound sources in a video. Previous work applied a comprehensive manually designed architecture with countless pixel-wise accurate masks as supervision. However, these pixel-level masks are expensive and not available in all cases. In this work, we aim to simplify the supervision as the instance-level annotation, i.e., weakly-supervised audio-visual segmentation. We present a novel Weakly-Supervised Audio-Visual Segmentation framework, namely WS-AVS, that can learn multi-scale audio-visual alignment with multi-scale multiple-instance contrastive learning for audio-visual segmentation. Extensive experiments on AVSBench demonstrate the effectiveness of our WS-AVS in the weakly-supervised audio-visual segmentation of single-source and multi-source scenarios.

Keywords

Cite

@article{arxiv.2311.15080,
  title  = {Weakly-Supervised Audio-Visual Segmentation},
  author = {Shentong Mo and Bhiksha Raj},
  journal= {arXiv preprint arXiv:2311.15080},
  year   = {2023}
}
R2 v1 2026-06-28T13:31:27.638Z