Related papers: Learning Audio Representations with MLPs
Methods for extracting audio and speech features have been studied since pioneering work on spectrum analysis decades ago. Recent efforts are guided by the ambition to develop general-purpose audio representations. For example, deep neural…
The performance of an Acoustic Scene Classification (ASC) system is highly depending on the latent temporal dynamics of the audio signal. In this paper, we proposed a multiple layers temporal pooling method using CNN feature sequence as…
Audio representation learning based on deep neural networks (DNNs) emerged as an alternative approach to hand-crafted features. For achieving high performance, DNNs often need a large amount of annotated data which can be difficult and…
Advancements in audio neural networks have established state-of-the-art results on downstream audio tasks. However, the black-box structure of these models makes it difficult to interpret the information encoded in their internal audio…
What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a…
Recent general-purpose audio representations show state-of-the-art performance on various audio tasks. These representations are pre-trained by self-supervised learning methods that create training signals from the input. For example,…
Automatic classification of sound commands is becoming increasingly important, especially for mobile and embedded devices. Many of these devices contain both cameras and microphones, and companies that develop them would like to use the…
With the advent of modern AI architectures, a shift has happened towards end-to-end architectures. This pivot has led to neural architectures being trained without domain-specific biases/knowledge, optimized according to the task. We in…
In this paper, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framework where the model can be updated without relying on labeled data. For this purpose, we propose…
Language-audio joint representation learning frameworks typically depend on deterministic embeddings, assuming a one-to-one correspondence between audio and text. In real-world settings, however, the language-audio relationship is…
Numerous examples in the literature proved that deep learning models have the ability to work well with multimodal data. Recently, CLIP has enabled deep learning systems to learn shared latent spaces between images and text descriptions,…
An ideal audio retrieval system efficiently and robustly recognizes a short query snippet from an extensive database. However, the performance of well-known audio fingerprinting systems falls short at high signal distortion levels. This…
Incremental learning aims to learn new tasks sequentially without forgetting the previously learned ones. Most of the existing incremental learning methods for audio focus on training the model from scratch on the initial task, and the same…
The goal of universal audio representation learning is to obtain foundational models that can be used for a variety of downstream tasks involving speech, music and environmental sounds. To approach this problem, methods inspired by works on…
We learn audio representations by solving a novel self-supervised learning task, which consists of predicting the phase of the short-time Fourier transform from its magnitude. A convolutional encoder is used to map the magnitude spectrum of…
Most existing deep learning-based acoustic scene classification (ASC) approaches directly utilize representations extracted from spectrograms to identify target scenes. However, these approaches pay little attention to the audio events…
Contrastive language-audio pre-training (CLAP), which learns audio-language representations by aligning audio and text in a common feature space, has become popular for solving audio tasks. However, CLAP's audio features lack…
Machine hearing or listening represents an emerging area. Conventional approaches rely on the design of handcrafted features specialized to a specific audio task and that can hardly generalized to other audio fields. For example,…
Audio-LLM introduces audio modality into a large language model (LLM) to enable a powerful LLM to recognize, understand, and generate audio. However, during speech recognition in noisy environments, we observed the presence of illusions and…
Large-scale vision-language models demonstrate strong multimodal alignment and generalization across diverse tasks. Among them, CLIP stands out as one of the most successful approaches. In this work, we extend the application of CLIP to…