Related papers: FAST: Fast Audio Spectrogram Transformer

AST: Audio Spectrogram Transformer

In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To…

Sound · Computer Science 2021-07-12 Yuan Gong , Yu-An Chung , James Glass

ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions

Transformers have rapidly overtaken CNN-based architectures as the new standard in audio classification. Transformer-based models, such as the Audio Spectrogram Transformers (AST), also inherit the fixed-size input paradigm from CNNs.…

Sound · Computer Science 2024-07-12 Jiu Feng , Mehmet Hamza Erol , Joon Son Chung , Arda Senocak

FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation

Audio classification models, particularly the Audio Spectrogram Transformer (AST), play a crucial role in efficient audio analysis. However, optimizing their efficiency without compromising accuracy remains a challenge. In this paper, we…

Sound · Computer Science 2024-06-13 Swarup Ranjan Behera , Abhishek Dhiman , Karthik Gowda , Aalekhya Satya Narayani

SSAST: Self-Supervised Audio Spectrogram Transformer

Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending…

Sound · Computer Science 2022-02-14 Yuan Gong , Cheng-I Jeff Lai , Yu-An Chung , James Glass

BAST: Binaural Audio Spectrogram Transformer for Binaural Sound Localization

Accurate sound localization in a reverberation environment is essential for human auditory perception. Recently, Convolutional Neural Networks (CNNs) have been utilized to model the binaural human auditory pathway. However, CNN shows…

Sound · Computer Science 2024-08-08 Sheng Kuang , Jie Shi , Kiki van der Heijden , Siamak Mehrkanoon

Multiscale Audio Spectrogram Transformer for Efficient Audio Classification

Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes. In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs…

Sound · Computer Science 2023-03-21 Wentao Zhu , Mohamed Omar

Transformer Based Machine Fault Detection From Audio Input

In recent years, Sound AI is being increasingly used to predict machine failures. By attaching a microphone to the machine of interest, one can get real time data on machine behavior from the field. Traditionally, Convolutional Neural Net…

Sound · Computer Science 2026-04-15 Kiran Voderhobli Holla

Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio Models

The introduction of large-scale audio datasets, such as AudioSet, paved the way for Transformers to conquer the audio domain and replace CNNs as the state-of-the-art neural network architecture for many tasks. Audio Spectrogram Transformers…

Sound · Computer Science 2023-10-25 Florian Schmid , Khaled Koutini , Gerhard Widmer

Transformer Architectures for Respiratory Sound Analysis and Multimodal Diagnosis

Respiratory sound analysis is a crucial tool for screening asthma and other pulmonary pathologies, yet traditional auscultation remains subjective and experience-dependent. Our prior research established a CNN baseline using DenseNet201,…

Sound · Computer Science 2026-01-21 Theodore Aptekarev , Vladimir Sokolovsky , Gregory Furman

FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation

We propose an accurate and efficient scene text detection framework, termed FAST (i.e., faster arbitrarily-shaped text detector). Different from recent advanced text detectors that used complicated post-processing and hand-crafted network…

Computer Vision and Pattern Recognition · Computer Science 2023-01-12 Zhe Chen , Jiahao Wang , Wenhai Wang , Guo Chen , Enze Xie , Ping Luo , Tong Lu

FAST: Flexible and Adaptive Semantic Transmission for Resource-constrained Multi-user Generative Semantic Communication

The rapid advancement of generative artificial intelligence has spurred innovative approaches to semantic communication, giving rise to a new paradigm known as generative semantic communication (GSC). The integration of flexible cross-modal…

Signal Processing · Electrical Eng. & Systems 2025-11-03 Yiru Wang , Wanting Yang , Fangli Mou , Zehui Xiong , Zide Fan , Shiwen Mao , Tony Q. S. Quek

Parameter-Efficient Transfer Learning of Audio Spectrogram Transformers

Parameter-efficient transfer learning (PETL) methods have emerged as a solid alternative to the standard full fine-tuning approach. They only train a few extra parameters for each downstream task, without sacrificing performance and…

Audio and Speech Processing · Electrical Eng. & Systems 2024-07-16 Umberto Cappellazzo , Daniele Falavigna , Alessio Brutti , Mirco Ravanelli

Adaptive Vehicle Speed Classification via BMCNN with Reinforcement Learning-Enhanced Acoustic Processing

Traffic congestion remains a pressing urban challenge, requiring intelligent transportation systems for real-time management. We present a hybrid framework that combines deep learning and reinforcement learning for acoustic vehicle speed…

Sound · Computer Science 2025-09-03 Yuli Zhang , Pengfei Fan , Ruiyuan Jiang , Hankang Gu , Dongyao Jia , Xinheng Wang

Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

Conformer-based models have become the dominant end-to-end architecture for speech processing tasks. With the objective of enhancing the conformer architecture for efficient training and inference, we carefully redesigned Conformer with a…

Audio and Speech Processing · Electrical Eng. & Systems 2023-10-03 Dima Rekesh , Nithin Rao Koluguri , Samuel Kriman , Somshubra Majumdar , Vahid Noroozi , He Huang , Oleksii Hrinchuk , Krishna Puvvada , Ankur Kumar , Jagadeesh Balam , Boris Ginsburg

Audio Transformers

Over the past two decades, CNN architectures have produced compelling models of sound perception and cognition, learning hierarchical organizations of features. Analogous to successes in computer vision, audio feature classification can be…

Sound · Computer Science 2025-05-13 Prateek Verma , Jonathan Berger

Face: Fast, Accurate and Context-Aware Audio Annotation and Classification

This paper presents a context-aware framework for feature selection and classification procedures to realize a fast and accurate audio event annotation and classification. The context-aware design starts with exploring feature extraction…

Sound · Computer Science 2023-03-08 M. Mehrdad Morsali , Hoda Mohammadzade , Saeed Bagheri Shouraki

Learning Robust Heterogeneous Signal Features from Parallel Neural Network for Audio Sentiment Analysis

Audio Sentiment Analysis is a popular research area which extends the conventional text-based sentiment analysis to depend on the effectiveness of acoustic features extracted from speech. However, current progress on audio sentiment…

Audio and Speech Processing · Electrical Eng. & Systems 2019-08-01 Feiyang Chen , Ziqian Luo

Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation

Audio Spectrogram Transformer models rule the field of Audio Tagging, outrunning previously dominating Convolutional Neural Networks (CNNs). Their superiority is based on the ability to scale up and exploit large-scale datasets such as…

Sound · Computer Science 2023-06-26 Florian Schmid , Khaled Koutini , Gerhard Widmer

FastWave: Accelerating Autoregressive Convolutional Neural Networks on FPGA

Autoregressive convolutional neural networks (CNNs) have been widely exploited for sequence generation tasks such as audio synthesis, language modeling and neural machine translation. WaveNet is a deep autoregressive CNN composed of several…

Audio and Speech Processing · Electrical Eng. & Systems 2020-02-13 Shehzeen Hussain , Mojan Javaheripi , Paarth Neekhara , Ryan Kastner , Farinaz Koushanfar

Fast FullSubNet: Accelerate Full-band and Sub-band Fusion Model for Single-channel Speech Enhancement

FullSubNet is our recently proposed real-time single-channel speech enhancement network that achieves outstanding performance on the Deep Noise Suppression (DNS) Challenge dataset. A number of variants of FullSubNet have been proposed, but…

Audio and Speech Processing · Electrical Eng. & Systems 2023-03-08 Xiang Hao , Xiaofei Li