Related papers: Visually-Guided Sound Source Separation with Audio…

Weakly-supervised Audio-visual Sound Source Detection and Separation

Learning how to localize and separate individual object sounds in the audio channel of the video is a difficult task. Current state-of-the-art methods predict audio masks from artificially mixed spectrograms, known as Mix-and-Separate…

Computer Vision and Pattern Recognition · Computer Science 2021-04-07 Tanzila Rahman , Leonid Sigal

Revisiting Audio-Visual Segmentation with Vision-Centric Transformer

Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal. Prevailing AVS methods typically adopt an audio-centric Transformer architecture, where object queries are derived…

Computer Vision and Pattern Recognition · Computer Science 2025-07-01 Shaofei Huang , Rui Ling , Tianrui Hui , Hongyu Li , Xu Zhou , Shifeng Zhang , Si Liu , Richang Hong , Meng Wang

VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency

We introduce a new approach for audio-visual speech separation. Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers. Whereas existing methods focus…

Computer Vision and Pattern Recognition · Computer Science 2021-04-07 Ruohan Gao , Kristen Grauman

High-Quality Sound Separation Across Diverse Categories via Visually-Guided Generative Modeling

We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning. Existing methods typically frame sound separation as a mask-based regression…

Computer Vision and Pattern Recognition · Computer Science 2025-09-29 Chao Huang , Susan Liang , Yapeng Tian , Anurag Kumar , Chenliang Xu

Visually Guided Sound Source Separation and Localization using Self-Supervised Motion Representations

The objective of this paper is to perform audio-visual sound source separation, i.e.~to separate component audios from a mixture based on the videos of sound sources. Moreover, we aim to pinpoint the source location in the input video…

Computer Vision and Pattern Recognition · Computer Science 2021-04-20 Lingyu Zhu , Esa Rahtu

Audio-Visual Cross-Modal Compression for Generative Face Video Coding

Generative face video coding (GFVC) is vital for modern applications like video conferencing, yet existing methods primarily focus on video motion while neglecting the significant bitrate contribution of audio. Despite the well-established…

Image and Video Processing · Electrical Eng. & Systems 2025-12-18 Youmin Xu , Mengxi Guo , Shijie Zhao , Weiqi Li , Junlin Li , Li Zhang , Jian Zhang

High-Quality Visually-Guided Sound Separation from Diverse Categories

We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning. Existing methods typically frame sound separation as a mask-based regression…

Computer Vision and Pattern Recognition · Computer Science 2024-10-14 Chao Huang , Susan Liang , Yapeng Tian , Anurag Kumar , Chenliang Xu

AV-CrossNet: an Audiovisual Complex Spectral Mapping Network for Speech Separation By Leveraging Narrow- and Cross-Band Modeling

Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an audiovisual (AV) system for speech enhancement, target speaker extraction, and multi-talker speaker separation.…

Audio and Speech Processing · Electrical Eng. & Systems 2024-06-18 Vahid Ahmadi Kalkhorani , Cheng Yu , Anurag Kumar , Ke Tan , Buye Xu , DeLiang Wang

Audio-Visual Grouping Network for Sound Localization from Mixtures

Sound source localization is a typical and challenging task that predicts the location of sound sources in a video. Previous single-source methods mainly used the audio-visual association as clues to localize sounding objects in each image.…

Computer Vision and Pattern Recognition · Computer Science 2023-03-31 Shentong Mo , Yapeng Tian

Leveraging Category Information for Single-Frame Visual Sound Source Separation

Visual sound source separation aims at identifying sound components from a given sound mixture with the presence of visual cues. Prior works have demonstrated impressive results, but with the expense of large multi-stage architectures and…

Computer Vision and Pattern Recognition · Computer Science 2021-04-19 Lingyu Zhu , Esa Rahtu

AVFSNet: Audio-Visual Speech Separation for Flexible Number of Speakers with Multi-Scale and Multi-Task Learning

Separating target speech from mixed signals containing flexible speaker quantities presents a challenging task. While existing methods demonstrate strong separation performance and noise robustness, they predominantly assume prior knowledge…

Audio and Speech Processing · Electrical Eng. & Systems 2025-07-18 Daning Zhang , Ying Wei

From Waveforms to Pixels: A Survey on Audio-Visual Segmentation

Audio-Visual Segmentation (AVS) aims to identify and segment sound-producing objects in videos by leveraging both visual and audio modalities. It has emerged as a significant research area in multimodal perception, enabling fine-grained…

Computer Vision and Pattern Recognition · Computer Science 2025-08-07 Jia Li , Yapeng Tian

Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction

The integration of visual cues has revitalized the performance of the target speech extraction task, elevating it to the forefront of the field. Nevertheless, this multi-modal learning paradigm often encounters the challenge of modality…

Sound · Computer Science 2024-05-07 Zhaoxi Mu , Xinyu Yang

Unsupervised Speech Segmentation and Variable Rate Representation Learning using Segmental Contrastive Predictive Coding

Typically, unsupervised segmentation of speech into the phone and word-like units are treated as separate tasks and are often done via different methods which do not fully leverage the inter-dependence of the two tasks. Here, we unify them…

Audio and Speech Processing · Electrical Eng. & Systems 2021-10-12 Saurabhchand Bhati , Jesús Villalba , Piotr Żelasko , Laureano Moro-Velazquez , Najim Dehak

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been…

Audio and Speech Processing · Electrical Eng. & Systems 2021-03-16 Daniel Michelsanti , Zheng-Hua Tan , Shi-Xiong Zhang , Yong Xu , Meng Yu , Dong Yu , Jesper Jensen

Audio-Visual Separation with Hierarchical Fusion and Representation Alignment

Self-supervised audio-visual source separation leverages natural correlations between audio and vision modalities to separate mixed audio signals. In this work, we first systematically analyse the performance of existing multimodal fusion…

Multimedia · Computer Science 2025-10-10 Han Hu , Dongheng Lin , Qiming Huang , Yuqi Hou , Hyung Jin Chang , Jianbo Jiao

Visually Guided Sound Source Separation using Cascaded Opponent Filter Network

The objective of this paper is to recover the original component signals from a mixture audio with the aid of visual cues of the sound sources. Such task is usually referred as visually guided sound source separation. The proposed Cascaded…

Computer Vision and Pattern Recognition · Computer Science 2020-07-15 Lingyu Zhu , Esa Rahtu

Audio-visual Speech Separation with Adversarially Disentangled Visual Representation

Speech separation aims to separate individual voice from an audio mixture of multiple simultaneous talkers. Although audio-only approaches achieve satisfactory performance, they build on a strategy to handle the predefined conditions,…

Sound · Computer Science 2020-12-01 Peng Zhang , Jiaming Xu , Jing shi , Yunzhe Hao , Bo Xu

Visual Scene Graphs for Audio Source Separation

State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments. These approaches often ignore the visual context of these sound sources or avoid…

Computer Vision and Pattern Recognition · Computer Science 2021-09-27 Moitreya Chatterjee , Jonathan Le Roux , Narendra Ahuja , Anoop Cherian

Co-Separating Sounds of Visual Objects

Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel. Current methods for visually-guided audio source separation sidestep the issue by training with artificially mixed video…

Computer Vision and Pattern Recognition · Computer Science 2019-08-22 Ruohan Gao , Kristen Grauman