English
Related papers

Related papers: Multiscale Audio Spectrogram Transformer for Effic…

200 papers

We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST). Given an input audio spectrogram, we first patchify…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-19 Sreyan Ghosh , Ashish Seth , S. Umesh , Dinesh Manocha

Audio classification models, particularly the Audio Spectrogram Transformer (AST), play a crucial role in efficient audio analysis. However, optimizing their efficiency without compromising accuracy remains a challenge. In this paper, we…

Sound · Computer Science 2024-06-13 Swarup Ranjan Behera , Abhishek Dhiman , Karthik Gowda , Aalekhya Satya Narayani

In recent years, researchers combine both audio and video signals to deal with challenges where actions are not well represented or captured by visual cues. However, how to effectively leverage the two modalities is still under development.…

Computer Vision and Pattern Recognition · Computer Science 2024-01-09 Wentao Zhu

Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending…

Sound · Computer Science 2022-02-14 Yuan Gong , Cheng-I Jeff Lai , Yu-An Chung , James Glass

Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require…

Sound · Computer Science 2022-02-03 Ke Chen , Xingjian Du , Bilei Zhu , Zejun Ma , Taylor Berg-Kirkpatrick , Shlomo Dubnov

In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification. Specifically, we leverage the insight that the SSAST uses a very…

Audio and Speech Processing · Electrical Eng. & Systems 2022-04-01 Alan Baade , Puyuan Peng , David Harwath

Bootstrap-based Self-Supervised Learning (SSL) has achieved remarkable progress in audio understanding. However, existing methods typically operate at a single level of granularity, limiting their ability to model the diverse temporal and…

Audio and Speech Processing · Electrical Eng. & Systems 2026-01-30 Bing Han , Chushu Zhou , Yifan Yang , Wei Wang , Chenda Li , Wangyou Zhang , Yanmin Qian

In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To…

Sound · Computer Science 2021-07-12 Yuan Gong , Yu-An Chung , James Glass

In audio classification, developing efficient and robust models is critical for real-time applications. Inspired by the design principles of MobileViT, we present FAST (Fast Audio Spectrogram Transformer), a new architecture that combines…

Sound · Computer Science 2025-04-21 Anugunj Naman , Gaibo Zhang

Transformers have rapidly overtaken CNN-based architectures as the new standard in audio classification. Transformer-based models, such as the Audio Spectrogram Transformers (AST), also inherit the fixed-size input paradigm from CNNs.…

Sound · Computer Science 2024-07-12 Jiu Feng , Mehmet Hamza Erol , Joon Son Chung , Arda Senocak

Accurate sound localization in a reverberation environment is essential for human auditory perception. Recently, Convolutional Neural Networks (CNNs) have been utilized to model the binaural human auditory pathway. However, CNN shows…

Sound · Computer Science 2024-08-08 Sheng Kuang , Jie Shi , Kiki van der Heijden , Siamak Mehrkanoon

Transformers have revolutionized the world of deep learning, specially in the field of natural language processing. Recently, the Audio Spectrogram Transformer (AST) was proposed for audio classification, leading to state of the art results…

Sound · Computer Science 2023-10-09 Leonardo Pepino , Pablo Riera , Luciana Ferrer

Transformers have become central to recent advances in audio classification. However, training an audio spectrogram transformer, e.g. AST, from scratch can be resource and time-intensive. Furthermore, the complexity of transformers heavily…

Sound · Computer Science 2024-01-17 Jiu Feng , Mehmet Hamza Erol , Joon Son Chung , Arda Senocak

Reasoning about spatial audio with large language models requires a spatial audio encoder as an acoustic front-end to obtain audio embeddings for further processing. Such an encoder needs to capture all information required to detect the…

Audio and Speech Processing · Electrical Eng. & Systems 2025-11-04 Kevin Wilkinghoff , Zheng-Hua Tan

The performance of most emotion recognition systems degrades in real-life situations ('in the wild' scenarios) where the audio is contaminated by reverberation. Our study explores new methods to alleviate the performance degradation of SER…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-17 Ohad Cohen , Gershon Hazan , Sharon Gannot

In this paper, we propose an effective sound event detection (SED) method based on the audio spectrogram transformer (AST) model, pretrained on the large-scale AudioSet for audio tagging (AT) task, termed AST-SED. Pretrained AST models have…

Audio and Speech Processing · Electrical Eng. & Systems 2023-03-08 Kang Li , Yan Song , Li-Rong Dai , Ian McLoughlin , Xin Fang , Lin Liu

This paper introduces a new paradigm for sound source lo-calization referred to as virtual acoustic space traveling (VAST) and presents a first dataset designed for this purpose. Existing sound source localization methods are either based…

Sound · Computer Science 2016-12-20 Clément Gaultier , Saurabh Kataria , Antoine Deleforge

This paper presents MAST, a new model for Multimodal Abstractive Text Summarization that utilizes information from all three modalities -- text, audio and video -- in a multimodal video. Prior work on multimodal abstractive text…

Computation and Language · Computer Science 2020-10-19 Aman Khullar , Udit Arora

Transformer structures have demonstrated outstanding skills in the deep learning space recently, significantly increasing the accuracy of models across a variety of domains. Researchers have started to question whether such a sophisticated…

Sound · Computer Science 2024-01-23 Qingfeng Ji , Jicun Zhang , Yuxin Wang
‹ Prev 1 2 3 10 Next ›