Related papers: Recursive Visual Sound Separation Using Minus-Plus…

CatNet: music source separation system with mix-audio augmentation

Music source separation (MSS) is the task of separating a music piece into individual sources, such as vocals and accompaniment. Recently, neural network based methods have been applied to address the MSS problem, and can be categorized…

Sound · Computer Science 2021-02-22 Xuchen Song , Qiuqiang Kong , Xingjian Du , Yuxuan Wang

Multi-scale Multi-band DenseNets for Audio Source Separation

This paper deals with the problem of audio source separation. To handle the complex and ill-posed nature of the problems of audio source separation, the current state-of-the-art approaches employ deep neural networks to obtain instrumental…

Sound · Computer Science 2017-06-30 Naoya Takahashi , Yuki Mitsufuji

From Coarse to Fine: Recursive Audio-Visual Semantic Enhancement for Speech Separation

Audio-visual speech separation aims to isolate each speaker's clean voice from mixtures by leveraging visual cues such as lip movements and facial features. While visual information provides complementary semantic guidance, existing methods…

Sound · Computer Science 2025-10-13 Ke Xue , Rongfei Fan , Lixin , Dawei Zhao , Chao Zhu , Han Hu

Multi-Task Audio Source Separation

The audio source separation tasks, such as speech enhancement, speech separation, and music source separation, have achieved impressive performance in recent studies. The powerful modeling capabilities of deep neural networks give us hope…

Audio and Speech Processing · Electrical Eng. & Systems 2021-07-15 Lu Zhang , Chenxing Li , Feng Deng , Xiaorui Wang

AV-CrossNet: an Audiovisual Complex Spectral Mapping Network for Speech Separation By Leveraging Narrow- and Cross-Band Modeling

Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an audiovisual (AV) system for speech enhancement, target speaker extraction, and multi-talker speaker separation.…

Audio and Speech Processing · Electrical Eng. & Systems 2024-06-18 Vahid Ahmadi Kalkhorani , Cheng Yu , Anurag Kumar , Ke Tan , Buye Xu , DeLiang Wang

TRUNet: Transformer-Recurrent-U Network for Multi-channel Reverberant Sound Source Separation

In recent years, many deep learning techniques for single-channel sound source separation have been proposed using recurrent, convolutional and transformer networks. When multiple microphones are available, spatial diversity between…

Audio and Speech Processing · Electrical Eng. & Systems 2022-08-23 Ali Aroudi , Stefan Uhlich , Marc Ferras Font

Source separation with weakly labelled data: An approach to computational auditory scene analysis

Source separation is the task to separate an audio recording into individual sound sources. Source separation is fundamental for computational auditory scene analysis. Previous work on source separation has focused on separating particular…

Sound · Computer Science 2020-02-07 Qiuqiang Kong , Yuxuan Wang , Xuchen Song , Yin Cao , Wenwu Wang , Mark D. Plumbley

Co-Separating Sounds of Visual Objects

Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel. Current methods for visually-guided audio source separation sidestep the issue by training with artificially mixed video…

Computer Vision and Pattern Recognition · Computer Science 2019-08-22 Ruohan Gao , Kristen Grauman

MMDenseLSTM: An efficient combination of convolutional and recurrent neural networks for audio source separation

Deep neural networks have become an indispensable technique for audio source separation (ASS). It was recently reported that a variant of CNN architecture called MMDenseNet was successfully employed to solve the ASS problem of estimating…

Sound · Computer Science 2018-05-30 Naoya Takahashi , Nabarun Goswami , Yuki Mitsufuji

Music Source Separation Using Stacked Hourglass Networks

In this paper, we propose a simple yet effective method for multiple music source separation using convolutional neural networks. Stacked hourglass network, which was originally designed for human pose estimation in natural images, is…

Sound · Computer Science 2018-06-25 Sungheon Park , Taehoon Kim , Kyogu Lee , Nojun Kwak

Hybrid Y-Net Architecture for Singing Voice Separation

This research paper presents a novel deep learning-based neural network architecture, named Y-Net, for achieving music source separation. The proposed architecture performs end-to-end hybrid source separation by extracting features from…

Sound · Computer Science 2023-03-07 Rashen Fernando , Pamudu Ranasinghe , Udula Ranasinghe , Janaka Wijayakulasooriya , Pantaleon Perera

Deep Remix: Remixing Musical Mixtures Using a Convolutional Deep Neural Network

Audio source separation is a difficult machine learning problem and performance is measured by comparing extracted signals with the component source signals. However, if separation is motivated by the ultimate goal of re-mixing then…

Sound · Computer Science 2015-05-05 Andrew J. R Simpson , Gerard Roma , Mark D. Plumbley

Leveraging Category Information for Single-Frame Visual Sound Source Separation

Visual sound source separation aims at identifying sound components from a given sound mixture with the presence of visual cues. Prior works have demonstrated impressive results, but with the expense of large multi-stage architectures and…

Computer Vision and Pattern Recognition · Computer Science 2021-04-19 Lingyu Zhu , Esa Rahtu

SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling

This paper addresses the challenge of audio-visual single-microphone speech separation and enhancement in the presence of real-world environmental noise. Our approach is based on generative inverse sampling, where we model clean speech and…

Audio and Speech Processing · Electrical Eng. & Systems 2026-02-03 Yochai Yemini , Yoav Ellinson , Rami Ben-Ari , Sharon Gannot , Ethan Fetaya

Weakly-supervised Audio-visual Sound Source Detection and Separation

Learning how to localize and separate individual object sounds in the audio channel of the video is a difficult task. Current state-of-the-art methods predict audio masks from artificially mixed spectrograms, known as Mix-and-Separate…

Computer Vision and Pattern Recognition · Computer Science 2021-04-07 Tanzila Rahman , Leonid Sigal

Semantic Grouping Network for Audio Source Separation

Recently, audio-visual separation approaches have taken advantage of the natural synchronization between the two modalities to boost audio source separation performance. They extracted high-level semantics from visual inputs as the guidance…

Sound · Computer Science 2024-07-08 Shentong Mo , Yapeng Tian

Separate What You Describe: Language-Queried Audio Source Separation

In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e.g., "a man tells a joke followed…

Audio and Speech Processing · Electrical Eng. & Systems 2022-03-30 Xubo Liu , Haohe Liu , Qiuqiang Kong , Xinhao Mei , Jinzheng Zhao , Qiushi Huang , Mark D. Plumbley , Wenwu Wang

Progressive Confident Masking Attention Network for Audio-Visual Segmentation

Audio and visual signals typically occur simultaneously, and humans possess an innate ability to correlate and synchronize information from these two modalities. Recently, a challenging problem known as Audio-Visual Segmentation (AVS) has…

Computer Vision and Pattern Recognition · Computer Science 2025-02-11 Yuxuan Wang , Jinchao Zhu , Feng Dong , Shuyue Zhu

An End-to-End Audio Classification System based on Raw Waveforms and Mix-Training Strategy

Audio classification can distinguish different kinds of sounds, which is helpful for intelligent applications in daily life. However, it remains a challenging task since the sound events in an audio clip is probably multiple, even…

Audio and Speech Processing · Electrical Eng. & Systems 2019-11-22 Jiaxu Chen , Jing Hao , Kai Chen , Di Xie , Shicai Yang , Shiliang Pu

The Sound of Pixels

We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos, learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel.…

Computer Vision and Pattern Recognition · Computer Science 2018-10-16 Hang Zhao , Chuang Gan , Andrew Rouditchenko , Carl Vondrick , Josh McDermott , Antonio Torralba