Related papers: ASM: Audio Spectrogram Mixer

Transformer Architectures for Respiratory Sound Analysis and Multimodal Diagnosis

Respiratory sound analysis is a crucial tool for screening asthma and other pulmonary pathologies, yet traditional auscultation remains subjective and experience-dependent. Our prior research established a CNN baseline using DenseNet201,…

Sound · Computer Science 2026-01-21 Theodore Aptekarev , Vladimir Sokolovsky , Gregory Furman

Multiscale Audio Spectrogram Transformer for Efficient Audio Classification

Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes. In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs…

Sound · Computer Science 2023-03-21 Wentao Zhu , Mohamed Omar

MAST: Multiscale Audio Spectrogram Transformers

We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST). Given an input audio spectrogram, we first patchify…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-19 Sreyan Ghosh , Ashish Seth , S. Umesh , Dinesh Manocha

Mixer is more than just a model

Recently, MLP structures have regained popularity, with MLP-Mixer standing out as a prominent example. In the field of computer vision, MLP-Mixer is noted for its ability to extract data information from both channel and token perspectives,…

Machine Learning · Computer Science 2024-03-05 Qingfeng Ji , Yuxin Wang , Letong Sun

AST: Audio Spectrogram Transformer

In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To…

Sound · Computer Science 2021-07-12 Yuan Gong , Yu-An Chung , James Glass

Hybrid Audio Detection Using Fine-Tuned Audio Spectrogram Transformers: A Dataset-Driven Evaluation of Mixed AI-Human Speech

The rapid advancement of artificial intelligence (AI) has enabled sophisticated audio generation and voice cloning technologies, posing significant security risks for applications reliant on voice authentication. While existing datasets and…

Sound · Computer Science 2025-05-22 Kunyang Huang , Bin Hu

Geometry-Aware Optimization for Respiratory Sound Classification: Enhancing Sensitivity with SAM-Optimized Audio Spectrogram Transformers

Respiratory sound classification is hindered by the limited size, high noise levels, and severe class imbalance of benchmark datasets like ICBHI 2017. While Transformer-based models offer powerful feature extraction capabilities, they are…

Audio and Speech Processing · Electrical Eng. & Systems 2025-12-30 Atakan Işık , Selin Vulga Işık , Ahmet Feridun Işık , Mahşuk Taylan

FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation

Audio classification models, particularly the Audio Spectrogram Transformer (AST), play a crucial role in efficient audio analysis. However, optimizing their efficiency without compromising accuracy remains a challenge. In this paper, we…

Sound · Computer Science 2024-06-13 Swarup Ranjan Behera , Abhishek Dhiman , Karthik Gowda , Aalekhya Satya Narayani

SSAST: Self-Supervised Audio Spectrogram Transformer

Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending…

Sound · Computer Science 2022-02-14 Yuan Gong , Cheng-I Jeff Lai , Yu-An Chung , James Glass

FAST: Fast Audio Spectrogram Transformer

In audio classification, developing efficient and robust models is critical for real-time applications. Inspired by the design principles of MobileViT, we present FAST (Fast Audio Spectrogram Transformer), a new architecture that combines…

Sound · Computer Science 2025-04-21 Anugunj Naman , Gaibo Zhang

A Study of Incorporating Articulatory Movement Information in Speech Enhancement

Although deep learning algorithms are widely used for improving speech enhancement (SE) performance, the performance remains limited under highly challenging conditions, such as unseen noise or noise signals having low signal-to-noise…

Audio and Speech Processing · Electrical Eng. & Systems 2021-06-10 Yu-Wen Chen , Kuo-Hsuan Hung , Shang-Yi Chuang , Jonathan Sherman , Xugang Lu , Yu Tsao

Structured Recurrent Mixers for Massively Parallelized Sequence Generation

Over the last two decades, language modeling has experienced a shift from the use of predominantly recurrent architectures that process tokens sequentially during training and inference to non-recurrent models that process sequence elements…

Computation and Language · Computer Science 2026-05-20 Benjamin L. Badger

Mixture to Beamformed Mixture: Leveraging Beamformed Mixture as Weak-Supervision for Speech Enhancement and Noise-Robust ASR

In multi-channel speech enhancement and robust automatic speech recognition (ASR), beamforming can typically improve the signal-to-noise ratio (SNR) of the target speaker and produce reliable enhancement with little distortion to target…

Audio and Speech Processing · Electrical Eng. & Systems 2025-07-22 Zhong-Qiu Wang , Ruizhe Pang

Patch-Mix Contrastive Learning with Audio Spectrogram Transformer on Respiratory Sound Classification

Respiratory sound contains crucial information for the early diagnosis of fatal lung diseases. Since the COVID-19 pandemic, there has been a growing interest in contact-free medical care based on electronic stethoscopes. To this end,…

Audio and Speech Processing · Electrical Eng. & Systems 2024-12-30 Sangmin Bae , June-Woo Kim , Won-Yang Cho , Hyerim Baek , Soyoun Son , Byungjo Lee , Changwan Ha , Kyongpil Tae , Sungnyun Kim , Se-Young Yun

Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency

In this paper, we show that a simple self-supervised pre-trained audio model can achieve comparable inference efficiency to more complicated pre-trained models with speech transformer encoders. These speech transformers rely on mixing…

Sound · Computer Science 2024-02-09 Sungho Jeon , Ching-Feng Yeh , Hakan Inan , Wei-Ning Hsu , Rashi Rungta , Yashar Mehdad , Daniel Bikel

ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions

Transformers have rapidly overtaken CNN-based architectures as the new standard in audio classification. Transformer-based models, such as the Audio Spectrogram Transformers (AST), also inherit the fixed-size input paradigm from CNNs.…

Sound · Computer Science 2024-07-12 Jiu Feng , Mehmet Hamza Erol , Joon Son Chung , Arda Senocak

From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers

Transformers have become central to recent advances in audio classification. However, training an audio spectrogram transformer, e.g. AST, from scratch can be resource and time-intensive. Furthermore, the complexity of transformers heavily…

Sound · Computer Science 2024-01-17 Jiu Feng , Mehmet Hamza Erol , Joon Son Chung , Arda Senocak

Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders

Connecting audio encoders with large language models (LLMs) allows the LLM to perform various audio understanding tasks, such as automatic speech recognition (ASR) and audio captioning (AC). Most research focuses on training an adapter…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-22 Weiqiao Shan , Yuang Li , Yuhao Zhang , Yingfeng Luo , Chen Xu , Xiaofeng Zhao , Long Meng , Yunfei Lu , Min Zhang , Hao Yang , Tong Xiao , Jingbo Zhu

Modality-Order Matters! A Novel Hierarchical Feature Fusion Method for CoSAm: A Code-Switched Autism Corpus

Autism Spectrum Disorder (ASD) is a complex neuro-developmental challenge, presenting a spectrum of difficulties in social interaction, communication, and the expression of repetitive behaviors in different situations. This increasing…

Machine Learning · Computer Science 2025-06-16 Mohd Mujtaba Akhtar , Girish , Muskaan Singh , Orchid Chetia Phukan

Probing the Information Encoded in Neural-based Acoustic Models of Automatic Speech Recognition Systems

Deep learning architectures have made significant progress in terms of performance in many research areas. The automatic speech recognition (ASR) field has thus benefited from these scientific and technological advances, particularly for…

Sound · Computer Science 2024-03-01 Quentin Raymondaud , Mickael Rouvier , Richard Dufour