English
Related papers

Related papers: Learning neural audio features without supervision

200 papers

Mel-filterbanks are fixed, engineered audio features which emulate human perception and have been used through the history of audio understanding up to today. However, their undeniable qualities are counterbalanced by the fundamental…

Sound · Computer Science 2021-01-22 Neil Zeghidour , Olivier Teboul , Félix de Chaumont Quitry , Marco Tagliasacchi

The purpose of this paper is to compare different learnable frontends in medical acoustics tasks. A framework has been implemented to classify human respiratory sounds and heartbeats in two categories, i.e. healthy or affected by…

Sound · Computer Science 2026-01-21 Alessandro Maria Poirè , Federico Simonetta , Stavros Ntalampiras

While much of modern speech and audio processing relies on deep neural networks trained using fixed audio representations, recent studies suggest great potential in acoustic frontends learnt jointly with a backend. In this study, we focus…

Audio and Speech Processing · Electrical Eng. & Systems 2023-02-21 Mark Anderson , Tomi Kinnunen , Naomi Harte

Hand-crafted features, such as Mel-filterbanks, have traditionally been the choice for many audio processing applications. Recently, there has been a growing interest in learnable front-ends that extract representations directly from the…

Audio and Speech Processing · Electrical Eng. & Systems 2025-02-06 Qiquan Zhang , Buddhi Wickramasinghe , Eliathamby Ambikairajah , Vidhyasaharan Sethu , Haizhou Li

We propose a learnable content adaptive front end for audio signal processing. Before the modern advent of deep learning, we used fixed representation non-learnable front-ends like spectrogram or mel-spectrogram with/without neural…

Sound · Computer Science 2024-12-24 Prateek Verma , Chris Chafe

Mel-scale spectrum features are used in various recognition and classification tasks on speech signals. There is no reason to expect that these features are optimal for all different tasks, including speaker verification (SV). This paper…

Audio and Speech Processing · Electrical Eng. & Systems 2022-06-16 Jingyu Li , Yusheng Tian , Tan Lee

Automatic species classification of birds from their sound is a computational tool of increasing importance in ecology, conservation monitoring and vocal communication studies. To make classification useful in practice, it is crucial to…

Sound · Computer Science 2014-07-14 Dan Stowell , Mark D. Plumbley

Recent progress in network-based audio event classification has shown the benefit of pre-training models on visual data such as ImageNet. While this process allows knowledge transfer across different domains, training a model on large-scale…

Sound · Computer Science 2021-05-21 Sascha Hornauer , Ke Li , Stella X. Yu , Shabnam Ghaffarzadegan , Liu Ren

In this work, we provide a broad comparative analysis of strategies for pre-training audio understanding models for several tasks in the music domain, including labelling of genre, era, origin, mood, instrumentation, key, pitch, vocal…

When convolutional neural networks are used to tackle learning problems based on music or, more generally, time series data, raw one-dimensional data are commonly pre-processed to obtain spectrogram or mel-spectrogram coefficients, which…

Machine Learning · Computer Science 2018-09-20 Monika Doerfler , Thomas Grill , Roswitha Bammer , Arthur Flexer

Deep neural networks have recently achieved breakthroughs in sound generation. Despite the outstanding sample quality, current sound generation models face issues on small-scale datasets (e.g., overfitting), significantly limiting…

Sound · Computer Science 2024-07-30 Yi Yuan , Haohe Liu , Jinhua Liang , Xubo Liu , Mark D. Plumbley , Wenwu Wang

In audio classification, differentiable auditory filterbanks with few parameters cover the middle ground between hard-coded spectrograms and raw audio. LEAF (arXiv:2101.08596), a Gabor-based filterbank combined with Per-Channel Energy…

Sound · Computer Science 2022-07-13 Jan Schlüter , Gerald Gutenbrunner

Deep learning has been applied to diverse audio semantics tasks, enabling the construction of models that learn hierarchical levels of features from high-dimensional raw data, delivering state-of-the-art performance. But do these algorithms…

Sound · Computer Science 2021-07-21 Lazaros Vrysis , Iordanis Thoidis , Charalampos Dimoulas , George Papanikolaou

Self-supervised pre-training using so-called "pretext" tasks has recently shown impressive performance across a wide range of modalities. In this work, we advance self-supervised learning from permutations, by pre-training a model to…

Sound · Computer Science 2021-05-05 Andrew N Carr , Quentin Berthet , Mathieu Blondel , Olivier Teboul , Neil Zeghidour

In this work, we investigated the teacher-student training paradigm to train a fully learnable multi-channel acoustic model for far-field automatic speech recognition (ASR). Using a large offline teacher model trained on beamformed audio,…

Sound · Computer Science 2020-05-05 Sanna Wager , Aparna Khare , Minhua Wu , Kenichi Kumatani , Shiva Sundaram

Deep neural networks (DNNs) have been shown to over-fit a dataset when being trained with noisy labels for a long enough time. To overcome this problem, we present a simple and effective method self-ensemble label filtering (SELF) to…

Computer Vision and Pattern Recognition · Computer Science 2019-10-07 Duc Tam Nguyen , Chaithanya Kumar Mummadi , Thi Phuong Nhung Ngo , Thi Hoai Phuong Nguyen , Laura Beggel , Thomas Brox

In audio signal processing, learnable front-ends have shown strong performance across diverse tasks by optimizing task-specific representation. However, their parameters remain fixed once trained, lacking flexibility during inference and…

Audio and Speech Processing · Electrical Eng. & Systems 2026-01-29 Hanyu Meng , Vidhyasaharan Sethu , Eliathamby Ambikairajah , Qiquan Zhang , Haizhou Li

The goal of this work is to train discriminative cross-modal embeddings without access to manually annotated data. Recent advances in self-supervised learning have shown that effective representations can be learnt from natural cross-modal…

Sound · Computer Science 2020-11-05 Soo-Whan Chung , Hong Goo Kang , Joon Son Chung

Many current deep learning approaches make extensive use of backbone networks pre-trained on large datasets like ImageNet, which are then fine-tuned to perform a certain task. In remote sensing, the lack of comparable large annotated…

Computer Vision and Pattern Recognition · Computer Science 2024-08-22 Konrad Heidler , Lichao Mou , Di Hu , Pu Jin , Guangyao Li , Chuang Gan , Ji-Rong Wen , Xiao Xiang Zhu

Representation learning from unlabeled data has been of major interest in artificial intelligence research. While self-supervised speech representation learning has been popular in the speech research community, very few works have…

‹ Prev 1 2 3 10 Next ›