Related papers: Learning Audio Representations with MLPs

BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping

Methods for extracting audio and speech features have been studied since pioneering work on spectrum analysis decades ago. Recent efforts are guided by the ambition to develop general-purpose audio representations. For example, deep neural…

Sound · Computer Science 2022-10-26 Gasser Elbanna , Neil Scheidwasser-Clow , Mikolaj Kegler , Pierre Beckmann , Karl El Hajal , Milos Cernak

Acoustic scene classification using multi-layer temporal pooling based on convolutional neural network

The performance of an Acoustic Scene Classification (ASC) system is highly depending on the latent temporal dynamics of the audio signal. In this paper, we proposed a multiple layers temporal pooling method using CNN feature sequence as…

Sound · Computer Science 2019-04-04 Liwen Zhang , Jiqing Han

COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations

Audio representation learning based on deep neural networks (DNNs) emerged as an alternative approach to hand-crafted features. For achieving high performance, DNNs often need a large amount of annotated data which can be difficult and…

Machine Learning · Computer Science 2020-07-09 Xavier Favory , Konstantinos Drossos , Tuomas Virtanen , Xavier Serra

Transformation of audio embeddings into interpretable, concept-based representations

Advancements in audio neural networks have established state-of-the-art results on downstream audio tasks. However, the black-box structure of these models makes it difficult to interpret the information encoded in their internal audio…

Sound · Computer Science 2025-04-22 Alice Zhang , Edison Thomaz , Lie Lu

HEAR: Holistic Evaluation of Audio Representations

What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a…

Sound · Computer Science 2025-06-18 Joseph Turian , Jordie Shier , Humair Raj Khan , Bhiksha Raj , Björn W. Schuller , Christian J. Steinmetz , Colin Malloy , George Tzanetakis , Gissel Velarde , Kirk McNally , Max Henry , Nicolas Pinto , Camille Noufi , Christian Clough , Dorien Herremans , Eduardo Fonseca , Jesse Engel , Justin Salamon , Philippe Esling , Pranay Manocha , Shinji Watanabe , Zeyu Jin , Yonatan Bisk

Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation

Recent general-purpose audio representations show state-of-the-art performance on various audio tasks. These representations are pre-trained by self-supervised learning methods that create training signals from the input. For example,…

Audio and Speech Processing · Electrical Eng. & Systems 2023-03-09 Daisuke Niizumi , Daiki Takeuchi , Yasunori Ohishi , Noboru Harada , Kunio Kashino

Deep Learning Approaches for Understanding Simple Speech Commands

Automatic classification of sound commands is becoming increasingly important, especially for mobile and embedded devices. Many of these devices contain both cameras and microphones, and companies that develop them would like to use the…

Sound · Computer Science 2018-10-08 Roman A. Solovyev , Maxim Vakhrushev , Alexander Radionov , Vladimir Aliev , Alexey A. Shvets

Diverse Audio Embeddings -- Bringing Features Back Outperforms CLAP!

With the advent of modern AI architectures, a shift has happened towards end-to-end architectures. This pivot has led to neural architectures being trained without domain-specific biases/knowledge, optimized according to the task. We in…

Sound · Computer Science 2025-05-08 Prateek Verma

Learning Representations for New Sound Classes With Continual Self-Supervised Learning

In this paper, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framework where the model can be updated without relying on labeled data. For this purpose, we propose…

Audio and Speech Processing · Electrical Eng. & Systems 2023-01-11 Zhepei Wang , Cem Subakan , Xilin Jiang , Junkai Wu , Efthymios Tzinis , Mirco Ravanelli , Paris Smaragdis

ProLAP: Probabilistic Language-Audio Pre-Training

Language-audio joint representation learning frameworks typically depend on deterministic embeddings, assuming a one-to-one correspondence between audio and text. In real-world settings, however, the language-audio relationship is…

Audio and Speech Processing · Electrical Eng. & Systems 2025-10-22 Toranosuke Manabe , Yuchi Ishikawa , Hokuto Munakata , Tatsuya Komatsu

SCRAPS: Speech Contrastive Representations of Acoustic and Phonetic Spaces

Numerous examples in the literature proved that deep learning models have the ability to work well with multimodal data. Recently, CLIP has enabled deep learning systems to learn shared latent spaces between images and text descriptions,…

Sound · Computer Science 2024-02-01 Ivan Vallés-Pérez , Grzegorz Beringer , Piotr Bilinski , Gary Cook , Roberto Barra-Chicote

Attention-Based Audio Embeddings for Query-by-Example

An ideal audio retrieval system efficiently and robustly recognizes a short query snippet from an extensive database. However, the performance of well-known audio fingerprinting systems falls short at high signal distortion levels. This…

Audio and Speech Processing · Electrical Eng. & Systems 2024-11-22 Anup Singh , Kris Demuynck , Vipul Arora

Online incremental learning for audio classification using a pretrained audio model

Incremental learning aims to learn new tasks sequentially without forgetting the previously learned ones. Most of the existing incremental learning methods for audio focus on training the model from scratch on the initial task, and the same…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-29 Manjunath Mulimani , Annamaria Mesaros

EnCodecMAE: Leveraging neural codecs for universal audio representation learning

The goal of universal audio representation learning is to obtain foundational models that can be used for a variety of downstream tasks involving speech, music and environmental sounds. To approach this problem, methods inspired by works on…

Sound · Computer Science 2024-05-22 Leonardo Pepino , Pablo Riera , Luciana Ferrer

Learning audio representations via phase prediction

We learn audio representations by solving a novel self-supervised learning task, which consists of predicting the phase of the short-time Fourier transform from its magnitude. A convolutional encoder is used to map the magnitude spectrum of…

Audio and Speech Processing · Electrical Eng. & Systems 2019-10-29 Félix de Chaumont Quitry , Marco Tagliasacchi , Dominik Roblek

Multi-dimensional Edge-based Audio Event Relational Graph Representation Learning for Acoustic Scene Classification

Most existing deep learning-based acoustic scene classification (ASC) approaches directly utilize representations extracted from spectrograms to identify target scenes. However, these approaches pay little attention to the audio events…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-03 Yuanbo Hou , Siyang Song , Chuang Yu , Yuxin Song , Wenwu Wang , Dick Botteldooren

M2D-CLAP: Exploring General-purpose Audio-Language Representations Beyond CLAP

Contrastive language-audio pre-training (CLAP), which learns audio-language representations by aligning audio and text in a common feature space, has become popular for solving audio tasks. However, CLAP's audio features lack…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-16 Daisuke Niizumi , Daiki Takeuchi , Masahiro Yasuda , Binh Thien Nguyen , Yasunori Ohishi , Noboru Harada

An efficient supervised dictionary learning method for audio signal recognition

Machine hearing or listening represents an emerging area. Conventional approaches rely on the design of handcrafted features specialized to a specific audio task and that can hardly generalized to other audio fields. For example,…

Computer Vision and Pattern Recognition · Computer Science 2018-12-13 Imad Rida , Romain Hérault , Gilles Gasso

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

Audio-LLM introduces audio modality into a large language model (LLM) to enable a powerful LLM to recognize, understand, and generate audio. However, during speech recognition in noisy environments, we observed the presence of illusions and…

Sound · Computer Science 2024-08-20 Yangze Li , Xiong Wang , Songjun Cao , Yike Zhang , Long Ma , Lei Xie

Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization

Large-scale vision-language models demonstrate strong multimodal alignment and generalization across diverse tasks. Among them, CLIP stands out as one of the most successful approaches. In this work, we extend the application of CLIP to…

Computer Vision and Pattern Recognition · Computer Science 2025-05-09 Sooyoung Park , Arda Senocak , Joon Son Chung