Related papers: AST: Audio Spectrogram Transformer

SSAST: Self-Supervised Audio Spectrogram Transformer

Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending…

Sound · Computer Science 2022-02-14 Yuan Gong , Cheng-I Jeff Lai , Yu-An Chung , James Glass

ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions

Transformers have rapidly overtaken CNN-based architectures as the new standard in audio classification. Transformer-based models, such as the Audio Spectrogram Transformers (AST), also inherit the fixed-size input paradigm from CNNs.…

Sound · Computer Science 2024-07-12 Jiu Feng , Mehmet Hamza Erol , Joon Son Chung , Arda Senocak

CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification

Audio classification is an active research area with a wide range of applications. Over the past decade, convolutional neural networks (CNNs) have been the de-facto standard building block for end-to-end audio classification models.…

Sound · Computer Science 2022-03-15 Yuan Gong , Sameer Khurana , Andrew Rouditchenko , James Glass

BAST: Binaural Audio Spectrogram Transformer for Binaural Sound Localization

Accurate sound localization in a reverberation environment is essential for human auditory perception. Recently, Convolutional Neural Networks (CNNs) have been utilized to model the binaural human auditory pathway. However, CNN shows…

Sound · Computer Science 2024-08-08 Sheng Kuang , Jie Shi , Kiki van der Heijden , Siamak Mehrkanoon

FAST: Fast Audio Spectrogram Transformer

In audio classification, developing efficient and robust models is critical for real-time applications. Inspired by the design principles of MobileViT, we present FAST (Fast Audio Spectrogram Transformer), a new architecture that combines…

Sound · Computer Science 2025-04-21 Anugunj Naman , Gaibo Zhang

Attention Is All You Need For Blind Room Volume Estimation

In recent years, dynamic parameterization of acoustic environments has raised increasing attention in the field of audio processing. One of the key parameters that characterize the local room acoustics in isolation from orientation and…

Audio and Speech Processing · Electrical Eng. & Systems 2023-12-29 Chunxi Wang , Maoshen Jia , Meiran Li , Changchun Bao , Wenyu Jin

Transformer Architectures for Respiratory Sound Analysis and Multimodal Diagnosis

Respiratory sound analysis is a crucial tool for screening asthma and other pulmonary pathologies, yet traditional auscultation remains subjective and experience-dependent. Our prior research established a CNN baseline using DenseNet201,…

Sound · Computer Science 2026-01-21 Theodore Aptekarev , Vladimir Sokolovsky , Gregory Furman

Audio Transformers

Over the past two decades, CNN architectures have produced compelling models of sound perception and cognition, learning hierarchical organizations of features. Analogous to successes in computer vision, audio feature classification can be…

Sound · Computer Science 2025-05-13 Prateek Verma , Jonathan Berger

Attention is All You Need? Good Embeddings with Statistics are enough:Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or ....

This paper presents a way of doing large scale audio understanding without traditional state of the art neural architectures. Ever since the introduction of deep learning for understanding audio signals in the past decade, convolutional…

Sound · Computer Science 2022-02-01 Prateek Verma

Parameter-Efficient Transfer Learning of Audio Spectrogram Transformers

Parameter-efficient transfer learning (PETL) methods have emerged as a solid alternative to the standard full fine-tuning approach. They only train a few extra parameters for each downstream task, without sacrificing performance and…

Audio and Speech Processing · Electrical Eng. & Systems 2024-07-16 Umberto Cappellazzo , Daniele Falavigna , Alessio Brutti , Mirco Ravanelli

Multiscale Audio Spectrogram Transformer for Efficient Audio Classification

Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes. In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs…

Sound · Computer Science 2023-03-21 Wentao Zhu , Mohamed Omar

Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio Models

The introduction of large-scale audio datasets, such as AudioSet, paved the way for Transformers to conquer the audio domain and replace CNNs as the state-of-the-art neural network architecture for many tasks. Audio Spectrogram Transformers…

Sound · Computer Science 2023-10-25 Florian Schmid , Khaled Koutini , Gerhard Widmer

Adapting a ConvNeXt model to audio classification on AudioSet

In computer vision, convolutional neural networks (CNN) such as ConvNeXt, have been able to surpass state-of-the-art transformers, partly thanks to depthwise separable convolutions (DSC). DSC, as an approximation of the regular convolution,…

Sound · Computer Science 2023-06-02 Thomas Pellegrini , Ismail Khalfaoui-Hassani , Etienne Labbé , Timothée Masquelier

Audio Captioning Transformer

Audio captioning aims to automatically generate a natural language description of an audio clip. Most captioning models follow an encoder-decoder architecture, where the decoder predicts words based on the audio features extracted by the…

Audio and Speech Processing · Electrical Eng. & Systems 2021-07-22 Xinhao Mei , Xubo Liu , Qiushi Huang , Mark D. Plumbley , Wenwu Wang

MAST: Multiscale Audio Spectrogram Transformers

We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST). Given an input audio spectrogram, we first patchify…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-19 Sreyan Ghosh , Ashish Seth , S. Umesh , Dinesh Manocha

A Deep Neural Network for Audio Classification with a Classifier Attention Mechanism

Audio classification is considered as a challenging problem in pattern recognition. Recently, many algorithms have been proposed using deep neural networks. In this paper, we introduce a new attention-based neural network architecture…

Audio and Speech Processing · Electrical Eng. & Systems 2020-06-18 Haoye Lu , Haolong Zhang , Amit Nayak

Conformer: Convolution-augmented Transformer for Speech Recognition

Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing…

Audio and Speech Processing · Electrical Eng. & Systems 2020-05-19 Anmol Gulati , James Qin , Chung-Cheng Chiu , Niki Parmar , Yu Zhang , Jiahui Yu , Wei Han , Shibo Wang , Zhengdong Zhang , Yonghui Wu , Ruoming Pang

Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation

Audio Spectrogram Transformer models rule the field of Audio Tagging, outrunning previously dominating Convolutional Neural Networks (CNNs). Their superiority is based on the ability to scale up and exploit large-scale datasets such as…

Sound · Computer Science 2023-06-26 Florian Schmid , Khaled Koutini , Gerhard Widmer

Atss-Net: Target Speaker Separation via Attention-based Neural Network

Recently, Convolutional Neural Network (CNN) and Long short-term memory (LSTM) based models have been introduced to deep learning-based target speaker separation. In this paper, we propose an Attention-based neural network (Atss-Net) in the…

Audio and Speech Processing · Electrical Eng. & Systems 2020-05-20 Tingle Li , Qingjian Lin , Yuanyuan Bao , Ming Li

Convolution-Based Channel-Frequency Attention for Text-Independent Speaker Verification

Deep convolutional neural networks (CNNs) have been applied to extracting speaker embeddings with significant success in speaker verification. Incorporating the attention mechanism has shown to be effective in improving the model…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-01 Jingyu Li , Yusheng Tian , Tan Lee