Related papers: Cinematic Audio Source Separation Using Visual Cue…

Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation

Cinematic audio source separation (CASS), as a standalone problem of extracting individual stems from their mixture, is a fairly new subtask of audio source separation. A typical setup of CASS is a three-stem problem, with the aim of…

Audio and Speech Processing · Electrical Eng. & Systems 2024-08-27 Karn N. Watcharasupat , Chih-Wei Wu , Iroro Orife

A Knowledge-Driven Approach to Target Speech Extraction in the Presence of Background Sound Effects for Cinematic Audio Source Separation (CASS)

We propose a knowledge-driven approach to speech target extraction in the presence of background sound effects already recorded in cinematic audio. The specific knowledge sources studied are manners of articulation that are detected in…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-01 Chun-wei Ho , Sabato Marco Siniscalchi , Kai Li , Chin-Hui Lee

Remastering Divide and Remaster: A Cinematic Audio Source Separation Dataset with Multilingual Support

Cinematic audio source separation (CASS), as a problem of extracting the dialogue, music, and effects stems from their mixture, is a relatively new subtask of audio source separation. To date, only one publicly available dataset exists for…

Audio and Speech Processing · Electrical Eng. & Systems 2024-08-27 Karn N. Watcharasupat , Chih-Wei Wu , Iroro Orife

Separate Anything You Describe

Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and…

Audio and Speech Processing · Electrical Eng. & Systems 2024-12-03 Xubo Liu , Qiuqiang Kong , Yan Zhao , Haohe Liu , Yi Yuan , Yuzhuo Liu , Rui Xia , Yuxuan Wang , Mark D. Plumbley , Wenwu Wang

GASS: Generalizing Audio Source Separation with Large-scale Data

Universal source separation targets at separating the audio sources of an arbitrary mix, removing the constraint to operate on a specific domain like speech or music. Yet, the potential of universal source separation is limited because most…

Sound · Computer Science 2023-10-03 Jordi Pons , Xiaoyu Liu , Santiago Pascual , Joan Serrà

DnR-nonverbal: Cinematic Audio Source Separation Dataset Containing Non-Verbal Sounds

We propose a new dataset for cinematic audio source separation (CASS) that handles non-verbal sounds. Existing CASS datasets only contain reading-style sounds as a speech stem. These datasets differ from actual movie audio, which is more…

Sound · Computer Science 2025-06-10 Takuya Hasumi , Yusuke Fujita

Seeing Soundscapes: Audio-Visual Generation and Separation from Soundscapes Using Audio-Visual Separator

Recent audio-visual generative models have made substantial progress in generating images from audio. However, existing approaches focus on generating images from single-class audio and fail to generate images from mixed audio. To address…

Computer Vision and Pattern Recognition · Computer Science 2025-04-28 Minjae Kang , Martim Brandão

Self-Supervised Visual Acoustic Matching

Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment. Existing methods assume access to paired training data, where the audio is observed in both source and target…

Multimedia · Computer Science 2023-11-27 Arjun Somayazulu , Changan Chen , Kristen Grauman

High-Quality Visually-Guided Sound Separation from Diverse Categories

We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning. Existing methods typically frame sound separation as a mask-based regression…

Computer Vision and Pattern Recognition · Computer Science 2024-10-14 Chao Huang , Susan Liang , Yapeng Tian , Anurag Kumar , Chenliang Xu

Leveraging Category Information for Single-Frame Visual Sound Source Separation

Visual sound source separation aims at identifying sound components from a given sound mixture with the presence of visual cues. Prior works have demonstrated impressive results, but with the expense of large multi-stage architectures and…

Computer Vision and Pattern Recognition · Computer Science 2021-04-19 Lingyu Zhu , Esa Rahtu

MACS: Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment

Propelled by the breakthrough in deep generative models, audio-to-image generation has emerged as a pivotal cross-modal task that converts complex auditory signals into rich visual representations. However, previous works only focus on…

Sound · Computer Science 2025-12-11 Hao Zhou , Xiaobao Guo , Yuzhe Zhu , Adams Wai-Kin Kong

VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency

We introduce a new approach for audio-visual speech separation. Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers. Whereas existing methods focus…

Computer Vision and Pattern Recognition · Computer Science 2021-04-07 Ruohan Gao , Kristen Grauman

BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation Knowledge

Given an audio-visual pair, audio-visual segmentation (AVS) aims to locate sounding sources by predicting pixel-wise maps. Previous methods assume that each sound component in an audio signal always has a visual counterpart in the image.…

Computer Vision and Pattern Recognition · Computer Science 2023-08-22 Chen Liu , Peike Li , Hu Zhang , Lincheng Li , Zi Huang , Dadong Wang , Xin Yu

Visual Scene Graphs for Audio Source Separation

State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments. These approaches often ignore the visual context of these sound sources or avoid…

Computer Vision and Pattern Recognition · Computer Science 2021-09-27 Moitreya Chatterjee , Jonathan Le Roux , Narendra Ahuja , Anoop Cherian

RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues

While existing Audio-Visual Speech Separation (AVSS) methods primarily concentrate on the audio-visual fusion strategy for two-speaker separation, they demonstrate a severe performance drop in the multi-speaker separation scenarios.…

Sound · Computer Science 2024-07-31 Tianrui Pan , Jie Liu , Bohan Wang , Jie Tang , Gangshan Wu

High-Quality Sound Separation Across Diverse Categories via Visually-Guided Generative Modeling

We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning. Existing methods typically frame sound separation as a mask-based regression…

Computer Vision and Pattern Recognition · Computer Science 2025-09-29 Chao Huang , Susan Liang , Yapeng Tian , Anurag Kumar , Chenliang Xu

Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding

The framework of visually-guided sound source separation generally consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing. An ongoing trend in this field has been to tailor involved visual…

Sound · Computer Science 2023-06-21 Zengjie Song , Zhaoxiang Zhang

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

Audio-visual video segmentation~(AVVS) aims to generate pixel-level maps of sound-producing objects within image frames and ensure the maps faithfully adhere to the given audio, such as identifying and segmenting a singing person in a…

Computer Vision and Pattern Recognition · Computer Science 2023-09-21 Kexin Li , Zongxin Yang , Lei Chen , Yi Yang , Jun Xiao

Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation

The audio-visual sound separation field assumes visible sources in videos, but this excludes invisible sounds beyond the camera's view. Current methods struggle with such sounds lacking visible cues. This paper introduces a novel…

Computer Vision and Pattern Recognition · Computer Science 2023-10-19 Yiyang Su , Ali Vosoughi , Shijian Deng , Yapeng Tian , Chenliang Xu

QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition

Audiovisual segmentation (AVS) is a challenging task that aims to segment visual objects in videos according to their associated acoustic cues. With multiple sound sources and background disturbances involved, establishing robust…

Computer Vision and Pattern Recognition · Computer Science 2024-04-22 Xiang Li , Jinglu Wang , Xiaohao Xu , Xiulian Peng , Rita Singh , Yan Lu , Bhiksha Raj