Audio-Visual Segmentation

Jinxing Zhou; Jianyuan Wang; Jiayi Zhang; Weixuan Sun; Jing Zhang; Stan Birchfield; Dan Guo; Lingpeng Kong; Meng Wang; Yiran Zhong

Audio-Visual Segmentation

Computer Vision and Pattern Recognition 2023-02-20 v3 Multimedia Sound Audio and Speech Processing Image and Video Processing

Authors: Jinxing Zhou , Jianyuan Wang , Jiayi Zhang , Weixuan Sun , Jing Zhang , Stan Birchfield , Dan Guo , Lingpeng Kong , Meng Wang , Yiran Zhong

View on arXiv ↗ PDF ↗

Abstract

We propose to explore a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos. Two settings are studied with this benchmark: 1) semi-supervised audio-visual segmentation with a single sound source and 2) fully-supervised audio-visual segmentation with multiple sound sources. To deal with the AVS problem, we propose a novel method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage the audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench compare our approach to several existing methods from related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code is available at https://github.com/OpenNLPLab/AVSBench.

Keywords

audio-visual speech recognition video segmentation semantic segmentation

Cite

@article{arxiv.2207.05042,
  title  = {Audio-Visual Segmentation},
  author = {Jinxing Zhou and Jianyuan Wang and Jiayi Zhang and Weixuan Sun and Jing Zhang and Stan Birchfield and Dan Guo and Lingpeng Kong and Meng Wang and Yiran Zhong},
  journal= {arXiv preprint arXiv:2207.05042},
  year   = {2023}
}

Comments

ECCV 2022; Code is available at https://github.com/OpenNLPLab/AVSBench

Audio-Visual Segmentation

Abstract

Keywords

Cite

Comments

Related papers