English
Related papers

Related papers: Learning Audio Representations with MLPs

200 papers

Methods for extracting audio and speech features have been studied since pioneering work on spectrum analysis decades ago. Recent efforts are guided by the ambition to develop general-purpose audio representations. For example, deep neural…

The performance of an Acoustic Scene Classification (ASC) system is highly depending on the latent temporal dynamics of the audio signal. In this paper, we proposed a multiple layers temporal pooling method using CNN feature sequence as…

Sound · Computer Science 2019-04-04 Liwen Zhang , Jiqing Han

Audio representation learning based on deep neural networks (DNNs) emerged as an alternative approach to hand-crafted features. For achieving high performance, DNNs often need a large amount of annotated data which can be difficult and…

Machine Learning · Computer Science 2020-07-09 Xavier Favory , Konstantinos Drossos , Tuomas Virtanen , Xavier Serra

Advancements in audio neural networks have established state-of-the-art results on downstream audio tasks. However, the black-box structure of these models makes it difficult to interpret the information encoded in their internal audio…

Sound · Computer Science 2025-04-22 Alice Zhang , Edison Thomaz , Lie Lu

What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a…

Recent general-purpose audio representations show state-of-the-art performance on various audio tasks. These representations are pre-trained by self-supervised learning methods that create training signals from the input. For example,…

Audio and Speech Processing · Electrical Eng. & Systems 2023-03-09 Daisuke Niizumi , Daiki Takeuchi , Yasunori Ohishi , Noboru Harada , Kunio Kashino

Automatic classification of sound commands is becoming increasingly important, especially for mobile and embedded devices. Many of these devices contain both cameras and microphones, and companies that develop them would like to use the…

With the advent of modern AI architectures, a shift has happened towards end-to-end architectures. This pivot has led to neural architectures being trained without domain-specific biases/knowledge, optimized according to the task. We in…

Sound · Computer Science 2025-05-08 Prateek Verma

In this paper, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framework where the model can be updated without relying on labeled data. For this purpose, we propose…

Audio and Speech Processing · Electrical Eng. & Systems 2023-01-11 Zhepei Wang , Cem Subakan , Xilin Jiang , Junkai Wu , Efthymios Tzinis , Mirco Ravanelli , Paris Smaragdis

Language-audio joint representation learning frameworks typically depend on deterministic embeddings, assuming a one-to-one correspondence between audio and text. In real-world settings, however, the language-audio relationship is…

Audio and Speech Processing · Electrical Eng. & Systems 2025-10-22 Toranosuke Manabe , Yuchi Ishikawa , Hokuto Munakata , Tatsuya Komatsu

Numerous examples in the literature proved that deep learning models have the ability to work well with multimodal data. Recently, CLIP has enabled deep learning systems to learn shared latent spaces between images and text descriptions,…

An ideal audio retrieval system efficiently and robustly recognizes a short query snippet from an extensive database. However, the performance of well-known audio fingerprinting systems falls short at high signal distortion levels. This…

Audio and Speech Processing · Electrical Eng. & Systems 2024-11-22 Anup Singh , Kris Demuynck , Vipul Arora

Incremental learning aims to learn new tasks sequentially without forgetting the previously learned ones. Most of the existing incremental learning methods for audio focus on training the model from scratch on the initial task, and the same…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-29 Manjunath Mulimani , Annamaria Mesaros

The goal of universal audio representation learning is to obtain foundational models that can be used for a variety of downstream tasks involving speech, music and environmental sounds. To approach this problem, methods inspired by works on…

Sound · Computer Science 2024-05-22 Leonardo Pepino , Pablo Riera , Luciana Ferrer

We learn audio representations by solving a novel self-supervised learning task, which consists of predicting the phase of the short-time Fourier transform from its magnitude. A convolutional encoder is used to map the magnitude spectrum of…

Audio and Speech Processing · Electrical Eng. & Systems 2019-10-29 Félix de Chaumont Quitry , Marco Tagliasacchi , Dominik Roblek

Most existing deep learning-based acoustic scene classification (ASC) approaches directly utilize representations extracted from spectrograms to identify target scenes. However, these approaches pay little attention to the audio events…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-03 Yuanbo Hou , Siyang Song , Chuang Yu , Yuxin Song , Wenwu Wang , Dick Botteldooren

Contrastive language-audio pre-training (CLAP), which learns audio-language representations by aligning audio and text in a common feature space, has become popular for solving audio tasks. However, CLAP's audio features lack…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-16 Daisuke Niizumi , Daiki Takeuchi , Masahiro Yasuda , Binh Thien Nguyen , Yasunori Ohishi , Noboru Harada

Machine hearing or listening represents an emerging area. Conventional approaches rely on the design of handcrafted features specialized to a specific audio task and that can hardly generalized to other audio fields. For example,…

Computer Vision and Pattern Recognition · Computer Science 2018-12-13 Imad Rida , Romain Hérault , Gilles Gasso

Audio-LLM introduces audio modality into a large language model (LLM) to enable a powerful LLM to recognize, understand, and generate audio. However, during speech recognition in noisy environments, we observed the presence of illusions and…

Sound · Computer Science 2024-08-20 Yangze Li , Xiong Wang , Songjun Cao , Yike Zhang , Long Ma , Lei Xie

Large-scale vision-language models demonstrate strong multimodal alignment and generalization across diverse tasks. Among them, CLIP stands out as one of the most successful approaches. In this work, we extend the application of CLIP to…

Computer Vision and Pattern Recognition · Computer Science 2025-05-09 Sooyoung Park , Arda Senocak , Joon Son Chung
‹ Prev 1 2 3 10 Next ›