Related papers: Multiscale Audio Spectrogram Transformer for Effic…

MAST: Multiscale Audio Spectrogram Transformers

We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST). Given an input audio spectrogram, we first patchify…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-19 Sreyan Ghosh , Ashish Seth , S. Umesh , Dinesh Manocha

FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation

Audio classification models, particularly the Audio Spectrogram Transformer (AST), play a crucial role in efficient audio analysis. However, optimizing their efficiency without compromising accuracy remains a challenge. In this paper, we…

Sound · Computer Science 2024-06-13 Swarup Ranjan Behera , Abhishek Dhiman , Karthik Gowda , Aalekhya Satya Narayani

Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification

In recent years, researchers combine both audio and video signals to deal with challenges where actions are not well represented or captured by visual cues. However, how to effectively leverage the two modalities is still under development.…

Computer Vision and Pattern Recognition · Computer Science 2024-01-09 Wentao Zhu

SSAST: Self-Supervised Audio Spectrogram Transformer

Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending…

Sound · Computer Science 2022-02-14 Yuan Gong , Cheng-I Jeff Lai , Yu-An Chung , James Glass

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require…

Sound · Computer Science 2022-02-03 Ke Chen , Xingjian Du , Bilei Zhu , Zejun Ma , Taylor Berg-Kirkpatrick , Shlomo Dubnov

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification. Specifically, we leverage the insight that the SSAST uses a very…

Audio and Speech Processing · Electrical Eng. & Systems 2022-04-01 Alan Baade , Puyuan Peng , David Harwath

Breaking the Barriers of Text-Hungry and Audio-Deficient AI

While global linguistic diversity spans more than 7164 recognized languages, the current dominant architecture of machine intelligence remains fundamentally biased toward written text. This bias excludes over 700 million people particularly…

Sound · Computer Science 2025-06-04 Hamidou Tembine , Issa Bamia , Massa NDong , Bakary Coulibaly , Oumar Issiaka Traore , Moussa Traore , Moussa Sanogo , Mamadou Eric Sangare , Salif Kante , Daryl Noupa Yongueng , Hafiz Tiomoko Ali , Malik Tiomoko , Frejus Laleye , Boualem Djehiche , Wesmanegda Elisee Dipama , Idris Baba Saje , Hammid Mohammed Ibrahim , Moumini Sanogo , Marie Coursel Nininahazwe , Abdul-Latif Siita , Haine Mhlongo , Teddy Nelvy Dieu Merci Kouka , Mariam Serine Jeridi , Mutiyamuogo Parfait Mupenge , Lekoueiry Dehah , Abdoul Aziz Bio Sidi Bouko , Wilfried Franceslas Zokoue , Odette Richette Sambila , Alina RS Mbango , Mady Diagouraga , Oumarou Moussa Sanoussi , Gizachew Dessalegn , Mohamed Lamine Samoura , Bintou Laetitia Audrey Coulibaly

Representation-Regularized Convolutional Audio Transformer for Audio Understanding

Bootstrap-based Self-Supervised Learning (SSL) has achieved remarkable progress in audio understanding. However, existing methods typically operate at a single level of granularity, limiting their ability to model the diverse temporal and…

Audio and Speech Processing · Electrical Eng. & Systems 2026-01-30 Bing Han , Chushu Zhou , Yifan Yang , Wei Wang , Chenda Li , Wangyou Zhang , Yanmin Qian

AST: Audio Spectrogram Transformer

In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To…

Sound · Computer Science 2021-07-12 Yuan Gong , Yu-An Chung , James Glass

FAST: Fast Audio Spectrogram Transformer

In audio classification, developing efficient and robust models is critical for real-time applications. Inspired by the design principles of MobileViT, we present FAST (Fast Audio Spectrogram Transformer), a new architecture that combines…

Sound · Computer Science 2025-04-21 Anugunj Naman , Gaibo Zhang

ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions

Transformers have rapidly overtaken CNN-based architectures as the new standard in audio classification. Transformer-based models, such as the Audio Spectrogram Transformers (AST), also inherit the fixed-size input paradigm from CNNs.…

Sound · Computer Science 2024-07-12 Jiu Feng , Mehmet Hamza Erol , Joon Son Chung , Arda Senocak

BAST: Binaural Audio Spectrogram Transformer for Binaural Sound Localization

Accurate sound localization in a reverberation environment is essential for human auditory perception. Recently, Convolutional Neural Networks (CNNs) have been utilized to model the binaural human auditory pathway. However, CNN shows…

Sound · Computer Science 2024-08-08 Sheng Kuang , Jie Shi , Kiki van der Heijden , Siamak Mehrkanoon

Study of positional encoding approaches for Audio Spectrogram Transformers

Transformers have revolutionized the world of deep learning, specially in the field of natural language processing. Recently, the Audio Spectrogram Transformer (AST) was proposed for audio classification, leading to state of the art results…

Sound · Computer Science 2023-10-09 Leonardo Pepino , Pablo Riera , Luciana Ferrer

From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers

Transformers have become central to recent advances in audio classification. However, training an audio spectrogram transformer, e.g. AST, from scratch can be resource and time-intensive. Furthermore, the complexity of transformers heavily…

Sound · Computer Science 2024-01-17 Jiu Feng , Mehmet Hamza Erol , Joon Son Chung , Arda Senocak

DSpAST: Disentangled Representations for Spatial Audio Reasoning with Large Language Models

Reasoning about spatial audio with large language models requires a spatial audio encoder as an acoustic front-end to obtain audio embeddings for further processing. Such an encoder needs to capture all information required to detect the…

Audio and Speech Processing · Electrical Eng. & Systems 2025-11-04 Kevin Wilkinghoff , Zheng-Hua Tan

Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer Architecture

The performance of most emotion recognition systems degrades in real-life situations ('in the wild' scenarios) where the audio is contaminated by reverberation. Our study explores new methods to alleviate the performance degradation of SER…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-17 Ohad Cohen , Gershon Hazan , Sharon Gannot

AST-SED: An Effective Sound Event Detection Method Based on Audio Spectrogram Transformer

In this paper, we propose an effective sound event detection (SED) method based on the audio spectrogram transformer (AST) model, pretrained on the large-scale AudioSet for audio tagging (AT) task, termed AST-SED. Pretrained AST models have…

Audio and Speech Processing · Electrical Eng. & Systems 2023-03-08 Kang Li , Yan Song , Li-Rong Dai , Ian McLoughlin , Xin Fang , Lin Liu

VAST : The Virtual Acoustic Space Traveler Dataset

This paper introduces a new paradigm for sound source lo-calization referred to as virtual acoustic space traveling (VAST) and presents a first dataset designed for this purpose. Existing sound source localization methods are either based…

Sound · Computer Science 2016-12-20 Clément Gaultier , Saurabh Kataria , Antoine Deleforge

MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention

This paper presents MAST, a new model for Multimodal Abstractive Text Summarization that utilizes information from all three modalities -- text, audio and video -- in a multimodal video. Prior work on multimodal abstractive text…

Computation and Language · Computer Science 2020-10-19 Aman Khullar , Udit Arora

ASM: Audio Spectrogram Mixer

Transformer structures have demonstrated outstanding skills in the deep learning space recently, significantly increasing the accuracy of models across a variety of domains. Researchers have started to question whether such a sophisticated…

Sound · Computer Science 2024-01-23 Qingfeng Ji , Jicun Zhang , Yuxin Wang