English
Related papers

Related papers: Visually-Guided Sound Source Separation with Audio…

200 papers

Learning how to localize and separate individual object sounds in the audio channel of the video is a difficult task. Current state-of-the-art methods predict audio masks from artificially mixed spectrograms, known as Mix-and-Separate…

Computer Vision and Pattern Recognition · Computer Science 2021-04-07 Tanzila Rahman , Leonid Sigal

Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal. Prevailing AVS methods typically adopt an audio-centric Transformer architecture, where object queries are derived…

Computer Vision and Pattern Recognition · Computer Science 2025-07-01 Shaofei Huang , Rui Ling , Tianrui Hui , Hongyu Li , Xu Zhou , Shifeng Zhang , Si Liu , Richang Hong , Meng Wang

We introduce a new approach for audio-visual speech separation. Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers. Whereas existing methods focus…

Computer Vision and Pattern Recognition · Computer Science 2021-04-07 Ruohan Gao , Kristen Grauman

We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning. Existing methods typically frame sound separation as a mask-based regression…

Computer Vision and Pattern Recognition · Computer Science 2025-09-29 Chao Huang , Susan Liang , Yapeng Tian , Anurag Kumar , Chenliang Xu

The objective of this paper is to perform audio-visual sound source separation, i.e.~to separate component audios from a mixture based on the videos of sound sources. Moreover, we aim to pinpoint the source location in the input video…

Computer Vision and Pattern Recognition · Computer Science 2021-04-20 Lingyu Zhu , Esa Rahtu

Generative face video coding (GFVC) is vital for modern applications like video conferencing, yet existing methods primarily focus on video motion while neglecting the significant bitrate contribution of audio. Despite the well-established…

Image and Video Processing · Electrical Eng. & Systems 2025-12-18 Youmin Xu , Mengxi Guo , Shijie Zhao , Weiqi Li , Junlin Li , Li Zhang , Jian Zhang

We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning. Existing methods typically frame sound separation as a mask-based regression…

Computer Vision and Pattern Recognition · Computer Science 2024-10-14 Chao Huang , Susan Liang , Yapeng Tian , Anurag Kumar , Chenliang Xu

Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an audiovisual (AV) system for speech enhancement, target speaker extraction, and multi-talker speaker separation.…

Audio and Speech Processing · Electrical Eng. & Systems 2024-06-18 Vahid Ahmadi Kalkhorani , Cheng Yu , Anurag Kumar , Ke Tan , Buye Xu , DeLiang Wang

Sound source localization is a typical and challenging task that predicts the location of sound sources in a video. Previous single-source methods mainly used the audio-visual association as clues to localize sounding objects in each image.…

Computer Vision and Pattern Recognition · Computer Science 2023-03-31 Shentong Mo , Yapeng Tian

Visual sound source separation aims at identifying sound components from a given sound mixture with the presence of visual cues. Prior works have demonstrated impressive results, but with the expense of large multi-stage architectures and…

Computer Vision and Pattern Recognition · Computer Science 2021-04-19 Lingyu Zhu , Esa Rahtu

Separating target speech from mixed signals containing flexible speaker quantities presents a challenging task. While existing methods demonstrate strong separation performance and noise robustness, they predominantly assume prior knowledge…

Audio and Speech Processing · Electrical Eng. & Systems 2025-07-18 Daning Zhang , Ying Wei

Audio-Visual Segmentation (AVS) aims to identify and segment sound-producing objects in videos by leveraging both visual and audio modalities. It has emerged as a significant research area in multimodal perception, enabling fine-grained…

Computer Vision and Pattern Recognition · Computer Science 2025-08-07 Jia Li , Yapeng Tian

The integration of visual cues has revitalized the performance of the target speech extraction task, elevating it to the forefront of the field. Nevertheless, this multi-modal learning paradigm often encounters the challenge of modality…

Sound · Computer Science 2024-05-07 Zhaoxi Mu , Xinyu Yang

Typically, unsupervised segmentation of speech into the phone and word-like units are treated as separate tasks and are often done via different methods which do not fully leverage the inter-dependence of the two tasks. Here, we unify them…

Audio and Speech Processing · Electrical Eng. & Systems 2021-10-12 Saurabhchand Bhati , Jesús Villalba , Piotr Żelasko , Laureano Moro-Velazquez , Najim Dehak

Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been…

Audio and Speech Processing · Electrical Eng. & Systems 2021-03-16 Daniel Michelsanti , Zheng-Hua Tan , Shi-Xiong Zhang , Yong Xu , Meng Yu , Dong Yu , Jesper Jensen

Self-supervised audio-visual source separation leverages natural correlations between audio and vision modalities to separate mixed audio signals. In this work, we first systematically analyse the performance of existing multimodal fusion…

Multimedia · Computer Science 2025-10-10 Han Hu , Dongheng Lin , Qiming Huang , Yuqi Hou , Hyung Jin Chang , Jianbo Jiao

The objective of this paper is to recover the original component signals from a mixture audio with the aid of visual cues of the sound sources. Such task is usually referred as visually guided sound source separation. The proposed Cascaded…

Computer Vision and Pattern Recognition · Computer Science 2020-07-15 Lingyu Zhu , Esa Rahtu

Speech separation aims to separate individual voice from an audio mixture of multiple simultaneous talkers. Although audio-only approaches achieve satisfactory performance, they build on a strategy to handle the predefined conditions,…

Sound · Computer Science 2020-12-01 Peng Zhang , Jiaming Xu , Jing shi , Yunzhe Hao , Bo Xu

State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments. These approaches often ignore the visual context of these sound sources or avoid…

Computer Vision and Pattern Recognition · Computer Science 2021-09-27 Moitreya Chatterjee , Jonathan Le Roux , Narendra Ahuja , Anoop Cherian

Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel. Current methods for visually-guided audio source separation sidestep the issue by training with artificially mixed video…

Computer Vision and Pattern Recognition · Computer Science 2019-08-22 Ruohan Gao , Kristen Grauman
‹ Prev 1 2 3 10 Next ›