Related papers: Can Textual Semantics Mitigate Sounding Object Seg…

Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics

The audio-visual segmentation (AVS) task aims to segment sounding objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve sounding object masks. However, we observed that prior…

Sound · Computer Science 2023-08-02 Chen Liu , Peike Li , Xingqun Qi , Hu Zhang , Lincheng Li , Dadong Wang , Xin Yu

Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

Traditional reference segmentation tasks have predominantly focused on silent visual scenes, neglecting the integral role of multimodal perception and interaction in human experiences. In this work, we introduce a novel task called…

Computer Vision and Pattern Recognition · Computer Science 2024-07-16 Yaoting Wang , Peiwen Sun , Dongzhan Zhou , Guangyao Li , Honggang Zhang , Di Hu

How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?

Audio-visual semantic segmentation (AVSS) represents an extension of the audio-visual segmentation (AVS) task, necessitating a semantic understanding of audio-visual scenes beyond merely identifying sound-emitting objects at the visual…

Computer Vision and Pattern Recognition · Computer Science 2026-03-03 Yujian Lee , Peng Gao , Yongqi Xu , Wentao Fan

Audio Visual Segmentation Through Text Embeddings

The goal of Audio-Visual Segmentation (AVS) is to localize and segment the sounding source objects from video frames. Research on AVS suffers from data scarcity due to the high cost of fine-grained manual annotations. Recent works attempt…

Computer Vision and Pattern Recognition · Computer Science 2025-05-30 Kyungbok Lee , You Zhang , Zhiyao Duan

Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?

Unlike traditional visual segmentation, audio-visual segmentation (AVS) requires the model not only to identify and segment objects but also to determine whether they are sound sources. Recent AVS approaches, leveraging transformer…

Sound · Computer Science 2025-02-24 Jia Li , Wenjie Zhao , Ziru Huang , Yunhui Guo , Yapeng Tian

Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues

How to effectively interact audio with vision has garnered considerable interest within the multi-modality research field. Recently, a novel audio-visual segmentation (AVS) task has been proposed, aiming to segment the sounding objects in…

Computer Vision and Pattern Recognition · Computer Science 2024-02-07 Tianxiang Chen , Zhentao Tan , Tao Gong , Qi Chu , Yue Wu , Bin Liu , Le Lu , Jieping Ye , Nenghai Yu

Audio-Visual Segmentation via Unlabeled Frame Exploitation

Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames. Although great progress has been witnessed, we experimentally reveal that current methods reach marginal performance gain within the use of the unlabeled…

Computer Vision and Pattern Recognition · Computer Science 2024-03-19 Jinxiang Liu , Yikun Liu , Fei Zhang , Chen Ju , Ya Zhang , Yanfeng Wang

Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation

Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues. The effectiveness of audio-visual learning critically depends on achieving accurate cross-modal alignment…

Computer Vision and Pattern Recognition · Computer Science 2024-08-15 Yuanhong Chen , Yuyuan Liu , Hu Wang , Fengbei Liu , Chong Wang , Helen Frazer , Gustavo Carneiro

Improving Audio-Visual Segmentation with Bidirectional Generation

The aim of audio-visual segmentation (AVS) is to precisely differentiate audible objects within videos down to the pixel level. Traditional approaches often tackle this challenge by combining information from various modalities, where the…

Computer Vision and Pattern Recognition · Computer Science 2023-12-20 Dawei Hao , Yuxin Mao , Bowen He , Xiaodong Han , Yuchao Dai , Yiran Zhong

From Waveforms to Pixels: A Survey on Audio-Visual Segmentation

Audio-Visual Segmentation (AVS) aims to identify and segment sound-producing objects in videos by leveraging both visual and audio modalities. It has emerged as a significant research area in multimodal perception, enabling fine-grained…

Computer Vision and Pattern Recognition · Computer Science 2025-08-07 Jia Li , Yapeng Tian

Audio-Visual Segmentation with Semantics

We propose a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first…

Computer Vision and Pattern Recognition · Computer Science 2023-01-31 Jinxing Zhou , Xuyang Shen , Jianyuan Wang , Jiayi Zhang , Weixuan Sun , Jing Zhang , Stan Birchfield , Dan Guo , Lingpeng Kong , Meng Wang , Yiran Zhong

3D Audio-Visual Segmentation

Recognizing the sounding objects in scenes is a longstanding objective in embodied AI, with diverse applications in robotics and AR/VR/MR. To that end, Audio-Visual Segmentation (AVS), taking as condition an audio signal to identify the…

Computer Vision and Pattern Recognition · Computer Science 2025-10-22 Artem Sokolov , Swapnil Bhosale , Xiatian Zhu

Look, Listen and Segment: Towards Weakly Supervised Audio-visual Semantic Segmentation

Audio-Visual Semantic Segmentation (AVSS) aligns audio and video at the pixel level but requires costly per-frame annotations. We introduce Weakly Supervised Audio-Visual Semantic Segmentation (WSAVSS), which uses only video-level labels to…

Multimedia · Computer Science 2026-03-24 Chengzhi Li , Heyan Huang , Ping Jian , Yanghao Zhou

Open-Vocabulary Audio-Visual Semantic Segmentation

Audio-visual semantic segmentation (AVSS) aims to segment and classify sounding objects in videos with acoustic cues. However, most approaches operate on the close-set assumption and only identify pre-defined categories from training data,…

Multimedia · Computer Science 2024-08-01 Ruohao Guo , Liao Qu , Dantong Niu , Yanyu Qi , Wenzhen Yue , Ji Shi , Bowei Xing , Xianghua Ying

Audio-Visual Segmentation

We propose to explore a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the…

Computer Vision and Pattern Recognition · Computer Science 2023-02-20 Jinxing Zhou , Jianyuan Wang , Jiayi Zhang , Weixuan Sun , Jing Zhang , Stan Birchfield , Dan Guo , Lingpeng Kong , Meng Wang , Yiran Zhong

BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation Knowledge

Given an audio-visual pair, audio-visual segmentation (AVS) aims to locate sounding sources by predicting pixel-wise maps. Previous methods assume that each sound component in an audio signal always has a visual counterpart in the image.…

Computer Vision and Pattern Recognition · Computer Science 2023-08-22 Chen Liu , Peike Li , Hu Zhang , Lincheng Li , Zi Huang , Dadong Wang , Xin Yu

Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation

Audio-Visual Segmentation (AVS) aims to achieve pixel-level localization of sound sources in videos, while Audio-Visual Semantic Segmentation (AVSS), as an extension of AVS, further pursues semantic understanding of audio-visual scenes.…

Computer Vision and Pattern Recognition · Computer Science 2024-09-13 Juncheng Ma , Peiwen Sun , Yaoting Wang , Di Hu

OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models

Audio-visual segmentation aims to separate sounding objects from videos by predicting pixel-level masks based on audio signals. Existing methods primarily concentrate on closed-set scenarios and direct audio-visual alignment and fusion,…

Machine Learning · Computer Science 2026-03-31 Shengkai Chen , Yifang Yin , Jinming Cao , Shili Xiang , Zhenguang Liu , Roger Zimmermann

Leveraging Foundation models for Unsupervised Audio-Visual Segmentation

Audio-Visual Segmentation (AVS) aims to precisely outline audible objects in a visual scene at the pixel level. Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion. This limits their…

Computer Vision and Pattern Recognition · Computer Science 2023-09-14 Swapnil Bhosale , Haosen Yang , Diptesh Kanojia , Xiatian Zhu

Can You Hear, Localize, and Segment Continually? An Exemplar-Free Continual Learning Benchmark for Audio-Visual Segmentation

Audio-Visual Segmentation (AVS) aims to produce pixel-level masks of sound producing objects in videos, by jointly learning from audio and visual signals. However, real-world environments are inherently dynamic, causing audio and visual…

Computer Vision and Pattern Recognition · Computer Science 2026-03-11 Siddeshwar Raghavan , Gautham Vinod , Bruce Coburn , Fengqing Zhu