Related papers: Audio Transformers

Transformer Based Machine Fault Detection From Audio Input

In recent years, Sound AI is being increasingly used to predict machine failures. By attaching a microphone to the machine of interest, one can get real time data on machine behavior from the field. Traditionally, Convolutional Neural Net…

Sound · Computer Science 2026-04-15 Kiran Voderhobli Holla

A Language Model With Million Context Length For Raw Audio

Modeling long-term dependencies for audio signals is a particularly challenging problem, as even small-time scales yield on the order of a hundred thousand samples. With the recent advent of Transformers, neural architectures became good at…

Sound · Computer Science 2024-12-24 Prateek Verma

Raw Waveform-based Audio Classification Using Sample-level CNN Architectures

Music, speech, and acoustic scene sound are often handled separately in the audio domain because of their different signal characteristics. However, as the image domain grows rapidly by versatile image classification models, it is necessary…

Sound · Computer Science 2017-12-05 Jongpil Lee , Taejun Kim , Jiyoung Park , Juhan Nam

Understanding Audio Pattern Using Convolutional Neural Network From Raw Waveforms

One key step in audio signal processing is to transform the raw signal into representations that are efficient for encoding the original information. Traditionally, people transform the audio into spectral representations, as a function of…

Sound · Computer Science 2016-11-30 Shuhui Qu , Juncheng Li , Wei Dai , Samarjit Das

Attention is All You Need? Good Embeddings with Statistics are enough:Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or ....

This paper presents a way of doing large scale audio understanding without traditional state of the art neural architectures. Ever since the introduction of deep learning for understanding audio signals in the past decade, convolutional…

Sound · Computer Science 2022-02-01 Prateek Verma

Speaker Recognition from Raw Waveform with SincNet

Deep learning is progressively gaining popularity as a viable alternative to i-vectors for speaker recognition. Promising results have been recently obtained with Convolutional Neural Networks (CNNs) when fed by raw speech samples directly.…

Audio and Speech Processing · Electrical Eng. & Systems 2019-08-12 Mirco Ravanelli , Yoshua Bengio

AST: Audio Spectrogram Transformer

In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To…

Sound · Computer Science 2021-07-12 Yuan Gong , Yu-An Chung , James Glass

Environmental Sound Classification Based on Multi-temporal Resolution Convolutional Neural Network Combining with Multi-level Features

Motivated by the fact that characteristics of different sound classes are highly diverse in different temporal scales and hierarchical levels, a novel deep convolutional neural network (CNN) architecture is proposed for the environmental…

Sound · Computer Science 2018-06-15 Boqing Zhu , Kele Xu , Dezhi Wang , Lilun Zhang , Bo Li , Yuxing Peng

Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio Models

The introduction of large-scale audio datasets, such as AudioSet, paved the way for Transformers to conquer the audio domain and replace CNNs as the state-of-the-art neural network architecture for many tasks. Audio Spectrogram Transformers…

Sound · Computer Science 2023-10-25 Florian Schmid , Khaled Koutini , Gerhard Widmer

Content Adaptive Front End For Audio Classification

We propose a learnable content adaptive front end for audio signal processing. Before the modern advent of deep learning, we used fixed representation non-learnable front-ends like spectrogram or mel-spectrogram with/without neural…

Sound · Computer Science 2024-12-24 Prateek Verma , Chris Chafe

Conformer: Convolution-augmented Transformer for Speech Recognition

Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing…

Audio and Speech Processing · Electrical Eng. & Systems 2020-05-19 Anmol Gulati , James Qin , Chung-Cheng Chiu , Niki Parmar , Yu Zhang , Jiahui Yu , Wei Han , Shibo Wang , Zhengdong Zhang , Yonghui Wu , Ruoming Pang

Very Deep Convolutional Neural Networks for Raw Waveforms

Learning acoustic models directly from the raw waveform data with minimal processing is challenging. Current waveform-based models have generally used very few (~2) convolutional layers, which might be insufficient for building high-level…

Sound · Computer Science 2016-10-04 Wei Dai , Chia Dai , Shuhui Qu , Juncheng Li , Samarjit Das

CNN Architectures for Large-Scale Audio Classification

Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with…

Sound · Computer Science 2017-01-11 Shawn Hershey , Sourish Chaudhuri , Daniel P. W. Ellis , Jort F. Gemmeke , Aren Jansen , R. Channing Moore , Manoj Plakal , Devin Platt , Rif A. Saurous , Bryan Seybold , Malcolm Slaney , Ron J. Weiss , Kevin Wilson

Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition

Transformer has achieved competitive performance against state-of-the-art end-to-end models in automatic speech recognition (ASR), and requires significantly less training time than RNN-based models. The original Transformer, with…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-14 Wenyong Huang , Wenchao Hu , Yu Ting Yeung , Xiao Chen

Dawn of the transformer era in speech emotion recognition: closing the valence gap

Recent advances in transformer-based architectures which are pre-trained in self-supervised manner have shown great promise in several machine learning tasks. In the audio domain, such architectures have also been successfully utilised in…

Audio and Speech Processing · Electrical Eng. & Systems 2023-09-11 Johannes Wagner , Andreas Triantafyllopoulos , Hagen Wierstorf , Maximilian Schmitt , Felix Burkhardt , Florian Eyben , Björn W. Schuller

Speech Recognition Front End Without Information Loss

Speech representation and modelling in high-dimensional spaces of acoustic waveforms, or a linear transformation thereof, is investigated with the aim of improving the robustness of automatic speech recognition to additive noise. The…

Computation and Language · Computer Science 2015-03-31 Matthew Ager , Zoran Cvetkovic , Peter Sollich

Audiomer: A Convolutional Transformer For Keyword Spotting

Transformers have seen an unprecedented rise in Natural Language Processing and Computer Vision tasks. However, in audio tasks, they are either infeasible to train due to extremely large sequence length of audio waveforms or incur a…

Machine Learning · Computer Science 2022-02-02 Surya Kant Sahu , Sai Mitheran , Juhi Kamdar , Meet Gandhi

A Generative Model for Raw Audio Using Transformer Architectures

This paper proposes a novel way of doing audio synthesis at the waveform level using Transformer architectures. We propose a deep neural network for generating waveforms, similar to wavenet. This is fully probabilistic, auto-regressive, and…

Sound · Computer Science 2021-07-09 Prateek Verma , Chris Chafe

Efficient Training of Audio Transformers with Patchout

The great success of transformer-based models in natural language processing (NLP) has led to various attempts at adapting these architectures to other domains such as vision and audio. Recent work has shown that transformers can outperform…

Sound · Computer Science 2023-01-26 Khaled Koutini , Jan Schlüter , Hamid Eghbal-zadeh , Gerhard Widmer

Speech and Speaker Recognition from Raw Waveform with SincNet

Deep neural networks can learn complex and abstract representations, that are progressively obtained by combining simpler ones. A recent trend in speech and speaker recognition consists in discovering these representations starting from raw…

Audio and Speech Processing · Electrical Eng. & Systems 2019-02-26 Mirco Ravanelli , Yoshua Bengio