Related papers: MAST: Multiscale Audio Spectrogram Transformers

Multiscale Audio Spectrogram Transformer for Efficient Audio Classification

Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes. In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs…

Sound · Computer Science 2023-03-21 Wentao Zhu , Mohamed Omar

SSAST: Self-Supervised Audio Spectrogram Transformer

Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending…

Sound · Computer Science 2022-02-14 Yuan Gong , Cheng-I Jeff Lai , Yu-An Chung , James Glass

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification. Specifically, we leverage the insight that the SSAST uses a very…

Audio and Speech Processing · Electrical Eng. & Systems 2022-04-01 Alan Baade , Puyuan Peng , David Harwath

Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification

In recent years, researchers combine both audio and video signals to deal with challenges where actions are not well represented or captured by visual cues. However, how to effectively leverage the two modalities is still under development.…

Computer Vision and Pattern Recognition · Computer Science 2024-01-09 Wentao Zhu

BAST: Binaural Audio Spectrogram Transformer for Binaural Sound Localization

Accurate sound localization in a reverberation environment is essential for human auditory perception. Recently, Convolutional Neural Networks (CNNs) have been utilized to model the binaural human auditory pathway. However, CNN shows…

Sound · Computer Science 2024-08-08 Sheng Kuang , Jie Shi , Kiki van der Heijden , Siamak Mehrkanoon

FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation

Audio classification models, particularly the Audio Spectrogram Transformer (AST), play a crucial role in efficient audio analysis. However, optimizing their efficiency without compromising accuracy remains a challenge. In this paper, we…

Sound · Computer Science 2024-06-13 Swarup Ranjan Behera , Abhishek Dhiman , Karthik Gowda , Aalekhya Satya Narayani

Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks

Self-supervised learning (SSL) has emerged as a popular approach for learning audio representations. One goal of audio self-supervised pre-training is to transfer knowledge to downstream audio tasks, generally including clip-level and…

Audio and Speech Processing · Electrical Eng. & Systems 2023-11-08 Xian Li , Nian Shao , Xiaofei Li

ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions

Transformers have rapidly overtaken CNN-based architectures as the new standard in audio classification. Transformer-based models, such as the Audio Spectrogram Transformers (AST), also inherit the fixed-size input paradigm from CNNs.…

Sound · Computer Science 2024-07-12 Jiu Feng , Mehmet Hamza Erol , Joon Son Chung , Arda Senocak

FlexiAST: Flexibility is What AST Needs

The objective of this work is to give patch-size flexibility to Audio Spectrogram Transformers (AST). Recent advancements in ASTs have shown superior performance in various audio-based tasks. However, the performance of standard ASTs…

Sound · Computer Science 2023-07-19 Jiu Feng , Mehmet Hamza Erol , Joon Son Chung , Arda Senocak

Breaking the Barriers of Text-Hungry and Audio-Deficient AI

While global linguistic diversity spans more than 7164 recognized languages, the current dominant architecture of machine intelligence remains fundamentally biased toward written text. This bias excludes over 700 million people particularly…

Sound · Computer Science 2025-06-04 Hamidou Tembine , Issa Bamia , Massa NDong , Bakary Coulibaly , Oumar Issiaka Traore , Moussa Traore , Moussa Sanogo , Mamadou Eric Sangare , Salif Kante , Daryl Noupa Yongueng , Hafiz Tiomoko Ali , Malik Tiomoko , Frejus Laleye , Boualem Djehiche , Wesmanegda Elisee Dipama , Idris Baba Saje , Hammid Mohammed Ibrahim , Moumini Sanogo , Marie Coursel Nininahazwe , Abdul-Latif Siita , Haine Mhlongo , Teddy Nelvy Dieu Merci Kouka , Mariam Serine Jeridi , Mutiyamuogo Parfait Mupenge , Lekoueiry Dehah , Abdoul Aziz Bio Sidi Bouko , Wilfried Franceslas Zokoue , Odette Richette Sambila , Alina RS Mbango , Mady Diagouraga , Oumarou Moussa Sanoussi , Gizachew Dessalegn , Mohamed Lamine Samoura , Bintou Laetitia Audrey Coulibaly

MAST: Masked Augmentation Subspace Training for Generalizable Self-Supervised Priors

Recent Self-Supervised Learning (SSL) methods are able to learn feature representations that are invariant to different data augmentations, which can then be transferred to downstream tasks of interest. However, different downstream tasks…

Machine Learning · Computer Science 2023-03-08 Chen Huang , Hanlin Goh , Jiatao Gu , Josh Susskind

Representation-Regularized Convolutional Audio Transformer for Audio Understanding

Bootstrap-based Self-Supervised Learning (SSL) has achieved remarkable progress in audio understanding. However, existing methods typically operate at a single level of granularity, limiting their ability to model the diverse temporal and…

Audio and Speech Processing · Electrical Eng. & Systems 2026-01-30 Bing Han , Chushu Zhou , Yifan Yang , Wei Wang , Chenda Li , Wangyou Zhang , Yanmin Qian

AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers

Transformer-based audio self-supervised learning (SSL) models commonly use spectrograms, vision-style Transformers, and masked modeling objectives. However, convolutional patchification with temporal downsampling lowers the effective…

Sound · Computer Science 2026-05-15 Kohei Yamamoto , Kosuke Okusa

DSpAST: Disentangled Representations for Spatial Audio Reasoning with Large Language Models

Reasoning about spatial audio with large language models requires a spatial audio encoder as an acoustic front-end to obtain audio embeddings for further processing. Such an encoder needs to capture all information required to detect the…

Audio and Speech Processing · Electrical Eng. & Systems 2025-11-04 Kevin Wilkinghoff , Zheng-Hua Tan

Full-Frequency Temporal Patching and Structured Masking for Enhanced Audio Classification

Transformers and State-Space Models (SSMs) have advanced audio classification by modeling spectrograms as sequences of patches. However, existing models such as the Audio Spectrogram Transformer (AST) and Audio Mamba (AuM) adopt square…

Sound · Computer Science 2025-09-01 Aditya Makineni , Baocheng Geng , Qing Tian

ULTRAS -- Unified Learning of Transformer Representations for Audio and Speech Signals

Self-supervised learning (SSL) has driven impressive advances in speech processing by adopting time-domain prediction objectives, while audio representation learning frameworks operate on time-frequency spectrograms. Models optimized for…

Audio and Speech Processing · Electrical Eng. & Systems 2026-04-09 Ameenudeen P E , Charumathi Narayanan , Sriram Ganapathy

AST: Audio Spectrogram Transformer

In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To…

Sound · Computer Science 2021-07-12 Yuan Gong , Yu-An Chung , James Glass

RASST: Fast Cross-modal Retrieval-Augmented Simultaneous Speech Translation

Simultaneous speech translation (SST) produces target text incrementally from partial speech input. Recent speech large language models (Speech LLMs) have substantially improved SST quality, yet they still struggle to correctly translate…

Computation and Language · Computer Science 2026-02-02 Jiaxuan Luo , Siqi Ouyang , Lei Li

Fusion of Modulation Spectrogram and SSL with Multi-head Attention for Fake Speech Detection

Fake speech detection systems have become a necessity to combat against speech deepfakes. Current systems exhibit poor generalizability on out-of-domain speech samples due to lack to diverse training data. In this paper, we attempt to…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-27 Rishith Sadashiv T N , Abhishek Bedge , Saisha Suresh Bore , Jagabandhu Mishra , Mrinmoy Bhattacharjee , S R Mahadeva Prasanna

Coupling Speech Encoders with Downstream Text Models

We present a modular approach to building cascade speech translation (AST) models that guarantees that the resulting model performs no worse than the 1-best cascade baseline while preserving state-of-the-art speech recognition (ASR) and…

Computation and Language · Computer Science 2024-07-26 Ciprian Chelba , Johan Schalkwyk