Related papers: FAST: Fast Audio Spectrogram Transformer
In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To…
Transformers have rapidly overtaken CNN-based architectures as the new standard in audio classification. Transformer-based models, such as the Audio Spectrogram Transformers (AST), also inherit the fixed-size input paradigm from CNNs.…
Audio classification models, particularly the Audio Spectrogram Transformer (AST), play a crucial role in efficient audio analysis. However, optimizing their efficiency without compromising accuracy remains a challenge. In this paper, we…
Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending…
Accurate sound localization in a reverberation environment is essential for human auditory perception. Recently, Convolutional Neural Networks (CNNs) have been utilized to model the binaural human auditory pathway. However, CNN shows…
Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes. In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs…
In recent years, Sound AI is being increasingly used to predict machine failures. By attaching a microphone to the machine of interest, one can get real time data on machine behavior from the field. Traditionally, Convolutional Neural Net…
The introduction of large-scale audio datasets, such as AudioSet, paved the way for Transformers to conquer the audio domain and replace CNNs as the state-of-the-art neural network architecture for many tasks. Audio Spectrogram Transformers…
Respiratory sound analysis is a crucial tool for screening asthma and other pulmonary pathologies, yet traditional auscultation remains subjective and experience-dependent. Our prior research established a CNN baseline using DenseNet201,…
We propose an accurate and efficient scene text detection framework, termed FAST (i.e., faster arbitrarily-shaped text detector). Different from recent advanced text detectors that used complicated post-processing and hand-crafted network…
The rapid advancement of generative artificial intelligence has spurred innovative approaches to semantic communication, giving rise to a new paradigm known as generative semantic communication (GSC). The integration of flexible cross-modal…
Parameter-efficient transfer learning (PETL) methods have emerged as a solid alternative to the standard full fine-tuning approach. They only train a few extra parameters for each downstream task, without sacrificing performance and…
Traffic congestion remains a pressing urban challenge, requiring intelligent transportation systems for real-time management. We present a hybrid framework that combines deep learning and reinforcement learning for acoustic vehicle speed…
Conformer-based models have become the dominant end-to-end architecture for speech processing tasks. With the objective of enhancing the conformer architecture for efficient training and inference, we carefully redesigned Conformer with a…
Over the past two decades, CNN architectures have produced compelling models of sound perception and cognition, learning hierarchical organizations of features. Analogous to successes in computer vision, audio feature classification can be…
This paper presents a context-aware framework for feature selection and classification procedures to realize a fast and accurate audio event annotation and classification. The context-aware design starts with exploring feature extraction…
Audio Sentiment Analysis is a popular research area which extends the conventional text-based sentiment analysis to depend on the effectiveness of acoustic features extracted from speech. However, current progress on audio sentiment…
Audio Spectrogram Transformer models rule the field of Audio Tagging, outrunning previously dominating Convolutional Neural Networks (CNNs). Their superiority is based on the ability to scale up and exploit large-scale datasets such as…
Autoregressive convolutional neural networks (CNNs) have been widely exploited for sequence generation tasks such as audio synthesis, language modeling and neural machine translation. WaveNet is a deep autoregressive CNN composed of several…
FullSubNet is our recently proposed real-time single-channel speech enhancement network that achieves outstanding performance on the Deep Noise Suppression (DNS) Challenge dataset. A number of variants of FullSubNet have been proposed, but…