English
Related papers

Related papers: SSAST: Self-Supervised Audio Spectrogram Transform…

200 papers

In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To…

Sound · Computer Science 2021-07-12 Yuan Gong , Yu-An Chung , James Glass

We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST). Given an input audio spectrogram, we first patchify…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-19 Sreyan Ghosh , Ashish Seth , S. Umesh , Dinesh Manocha

Transformers, which were originally developed for natural language processing, have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships.…

Sound · Computer Science 2024-08-15 Sara Atito , Muhammad Awais , Wenwu Wang , Mark D Plumbley , Josef Kittler

In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification. Specifically, we leverage the insight that the SSAST uses a very…

Audio and Speech Processing · Electrical Eng. & Systems 2022-04-01 Alan Baade , Puyuan Peng , David Harwath

Self-supervised learning (SSL) has emerged as a popular approach for learning audio representations. One goal of audio self-supervised pre-training is to transfer knowledge to downstream audio tasks, generally including clip-level and…

Audio and Speech Processing · Electrical Eng. & Systems 2023-11-08 Xian Li , Nian Shao , Xiaofei Li

Transformer-based audio self-supervised learning (SSL) models commonly use spectrograms, vision-style Transformers, and masked modeling objectives. However, convolutional patchification with temporal downsampling lowers the effective…

Sound · Computer Science 2026-05-15 Kohei Yamamoto , Kosuke Okusa

Representation learning from unlabeled data has been of major interest in artificial intelligence research. While self-supervised speech representation learning has been popular in the speech research community, very few works have…

Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes. In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs…

Sound · Computer Science 2023-03-21 Wentao Zhu , Mohamed Omar

Transformers have rapidly overtaken CNN-based architectures as the new standard in audio classification. Transformer-based models, such as the Audio Spectrogram Transformers (AST), also inherit the fixed-size input paradigm from CNNs.…

Sound · Computer Science 2024-07-12 Jiu Feng , Mehmet Hamza Erol , Joon Son Chung , Arda Senocak

Self-supervised learning (SSL) learns knowledge from a large amount of unlabeled data, and then transfers the knowledge to a specific problem with a limited number of labeled data. SSL has achieved promising results in various domains. This…

Audio and Speech Processing · Electrical Eng. & Systems 2023-06-08 Xian Li , Xiaofei Li

Acoustic scene classification (ASC) predominantly relies on supervised approaches. However, acquiring labeled data for training ASC models is often costly and time-consuming. Recently, self-supervised learning (SSL) has emerged as a…

Sound · Computer Science 2024-08-28 Yiqiang Cai , Shengchen Li , Xi Shao

Vision transformers (ViT) have made substantial progress for classification tasks in computer vision. Recently, Gong et. al. '21, introduced attention-based modeling for several audio tasks. However, relatively unexplored is the use of a…

Sound · Computer Science 2024-07-08 Chirag Goel , Surya Koppisetti , Ben Colman , Ali Shahriyari , Gaurav Bharaj

Transformer-based models attain excellent results and generalize well when trained on sufficient amounts of data. However, constrained by the limited data available in the audio domain, most transformer-based models for audio tasks are…

Sound · Computer Science 2022-04-28 Dading Chong , Helin Wang , Peilin Zhou , Qingcheng Zeng

Audio self-supervised learning (SSL) pre-training, which aims to learn good representations from unlabeled audio, has made remarkable progress. However, the extensive computational demands during pre-training pose a significant barrier to…

Audio and Speech Processing · Electrical Eng. & Systems 2024-01-09 Wenxi Chen , Yuzhe Liang , Ziyang Ma , Zhisheng Zheng , Xie Chen

Self-supervised Transformer based models, such as wav2vec 2.0 and HuBERT, have produced significant improvements over existing approaches to automatic speech recognition (ASR). This is evident in the performance of the wav2vec 2.0 based…

Computation and Language · Computer Science 2022-07-05 Mitchell DeHaven , Jayadev Billa

In audio classification, developing efficient and robust models is critical for real-time applications. Inspired by the design principles of MobileViT, we present FAST (Fast Audio Spectrogram Transformer), a new architecture that combines…

Sound · Computer Science 2025-04-21 Anugunj Naman , Gaibo Zhang

Self-supervised learning (SSL) is a powerful tool that allows learning of underlying representations from unlabeled data. Transformer based models such as wav2vec 2.0 and HuBERT are leading the field in the speech domain. Generally these…

Computation and Language · Computer Science 2022-02-08 Bethan Thomas , Samuel Kessler , Salah Karout

While the transformer has emerged as the eminent neural architecture, several independent lines of research have emerged to address its limitations. Recurrent neural approaches have observed a lot of renewed interest, including the extended…

Sound · Computer Science 2025-08-20 Sarthak Yadav , Sergios Theodoridis , Zheng-Hua Tan

Bootstrap-based Self-Supervised Learning (SSL) has achieved remarkable progress in audio understanding. However, existing methods typically operate at a single level of granularity, limiting their ability to model the diverse temporal and…

Audio and Speech Processing · Electrical Eng. & Systems 2026-01-30 Bing Han , Chushu Zhou , Yifan Yang , Wei Wang , Chenda Li , Wangyou Zhang , Yanmin Qian

Auto-regressive speech-text models pre-trained on interleaved text tokens and discretized speech tokens demonstrate strong speech understanding and generation, yet remain substantially less compute-efficient than text LLMs, partly due to…

Computation and Language · Computer Science 2026-03-11 Yen-Ju Lu , Yashesh Gaur , Wei Zhou , Benjamin Muller , Jesus Villalba , Najim Dehak , Luke Zettlemoyer , Gargi Ghosh , Mike Lewis , Srinivasan Iyer , Duc Le
‹ Prev 1 2 3 10 Next ›