English
Related papers

Related papers: MAST: Multiscale Audio Spectrogram Transformers

200 papers

Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes. In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs…

Sound · Computer Science 2023-03-21 Wentao Zhu , Mohamed Omar

Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending…

Sound · Computer Science 2022-02-14 Yuan Gong , Cheng-I Jeff Lai , Yu-An Chung , James Glass

In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification. Specifically, we leverage the insight that the SSAST uses a very…

Audio and Speech Processing · Electrical Eng. & Systems 2022-04-01 Alan Baade , Puyuan Peng , David Harwath

In recent years, researchers combine both audio and video signals to deal with challenges where actions are not well represented or captured by visual cues. However, how to effectively leverage the two modalities is still under development.…

Computer Vision and Pattern Recognition · Computer Science 2024-01-09 Wentao Zhu

Accurate sound localization in a reverberation environment is essential for human auditory perception. Recently, Convolutional Neural Networks (CNNs) have been utilized to model the binaural human auditory pathway. However, CNN shows…

Sound · Computer Science 2024-08-08 Sheng Kuang , Jie Shi , Kiki van der Heijden , Siamak Mehrkanoon

Audio classification models, particularly the Audio Spectrogram Transformer (AST), play a crucial role in efficient audio analysis. However, optimizing their efficiency without compromising accuracy remains a challenge. In this paper, we…

Sound · Computer Science 2024-06-13 Swarup Ranjan Behera , Abhishek Dhiman , Karthik Gowda , Aalekhya Satya Narayani

Self-supervised learning (SSL) has emerged as a popular approach for learning audio representations. One goal of audio self-supervised pre-training is to transfer knowledge to downstream audio tasks, generally including clip-level and…

Audio and Speech Processing · Electrical Eng. & Systems 2023-11-08 Xian Li , Nian Shao , Xiaofei Li

Transformers have rapidly overtaken CNN-based architectures as the new standard in audio classification. Transformer-based models, such as the Audio Spectrogram Transformers (AST), also inherit the fixed-size input paradigm from CNNs.…

Sound · Computer Science 2024-07-12 Jiu Feng , Mehmet Hamza Erol , Joon Son Chung , Arda Senocak

The objective of this work is to give patch-size flexibility to Audio Spectrogram Transformers (AST). Recent advancements in ASTs have shown superior performance in various audio-based tasks. However, the performance of standard ASTs…

Sound · Computer Science 2023-07-19 Jiu Feng , Mehmet Hamza Erol , Joon Son Chung , Arda Senocak

Recent Self-Supervised Learning (SSL) methods are able to learn feature representations that are invariant to different data augmentations, which can then be transferred to downstream tasks of interest. However, different downstream tasks…

Machine Learning · Computer Science 2023-03-08 Chen Huang , Hanlin Goh , Jiatao Gu , Josh Susskind

Bootstrap-based Self-Supervised Learning (SSL) has achieved remarkable progress in audio understanding. However, existing methods typically operate at a single level of granularity, limiting their ability to model the diverse temporal and…

Audio and Speech Processing · Electrical Eng. & Systems 2026-01-30 Bing Han , Chushu Zhou , Yifan Yang , Wei Wang , Chenda Li , Wangyou Zhang , Yanmin Qian

Transformer-based audio self-supervised learning (SSL) models commonly use spectrograms, vision-style Transformers, and masked modeling objectives. However, convolutional patchification with temporal downsampling lowers the effective…

Sound · Computer Science 2026-05-15 Kohei Yamamoto , Kosuke Okusa

Reasoning about spatial audio with large language models requires a spatial audio encoder as an acoustic front-end to obtain audio embeddings for further processing. Such an encoder needs to capture all information required to detect the…

Audio and Speech Processing · Electrical Eng. & Systems 2025-11-04 Kevin Wilkinghoff , Zheng-Hua Tan

Transformers and State-Space Models (SSMs) have advanced audio classification by modeling spectrograms as sequences of patches. However, existing models such as the Audio Spectrogram Transformer (AST) and Audio Mamba (AuM) adopt square…

Sound · Computer Science 2025-09-01 Aditya Makineni , Baocheng Geng , Qing Tian

Self-supervised learning (SSL) has driven impressive advances in speech processing by adopting time-domain prediction objectives, while audio representation learning frameworks operate on time-frequency spectrograms. Models optimized for…

Audio and Speech Processing · Electrical Eng. & Systems 2026-04-09 Ameenudeen P E , Charumathi Narayanan , Sriram Ganapathy

In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To…

Sound · Computer Science 2021-07-12 Yuan Gong , Yu-An Chung , James Glass

Simultaneous speech translation (SST) produces target text incrementally from partial speech input. Recent speech large language models (Speech LLMs) have substantially improved SST quality, yet they still struggle to correctly translate…

Computation and Language · Computer Science 2026-02-02 Jiaxuan Luo , Siqi Ouyang , Lei Li

Fake speech detection systems have become a necessity to combat against speech deepfakes. Current systems exhibit poor generalizability on out-of-domain speech samples due to lack to diverse training data. In this paper, we attempt to…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-27 Rishith Sadashiv T N , Abhishek Bedge , Saisha Suresh Bore , Jagabandhu Mishra , Mrinmoy Bhattacharjee , S R Mahadeva Prasanna

We present a modular approach to building cascade speech translation (AST) models that guarantees that the resulting model performs no worse than the 1-best cascade baseline while preserving state-of-the-art speech recognition (ASR) and…

Computation and Language · Computer Science 2024-07-26 Ciprian Chelba , Johan Schalkwyk
‹ Prev 1 2 3 10 Next ›