Related papers: Spectrogram features for audio and speech analysis
One of the decisions that arise when designing a neural network for any application is how the data should be represented in order to be presented to, and possibly generated by, a neural network. For audio, the choice is less obvious than…
Acoustic recognition has emerged as a prominent task in deep learning research, frequently utilizing spectral feature extraction techniques such as the spectrogram from the Short-Time Fourier Transform and the scalogram from the Wavelet…
Recent advancements in deep learning have significantly impacted the field of speech signal processing, particularly in the analysis and manipulation of complex spectrograms. This survey provides a comprehensive overview of the…
We propose the Neuralogram -- a deep neural network based representation for understanding audio signals which, as the name suggests, transforms an audio signal to a dense, compact representation based upon embeddings learned via a neural…
Pattern recognition from audio signals is an active research topic encompassing audio tagging, acoustic scene classification, music classification, and other areas. Spectrogram and mel-frequency cepstral coefficients (MFCC) are among the…
Several methods have recently been proposed to analyze speech and automatically infer the personality of the speaker. These methods often rely on prosodic and other hand crafted speech processing features extracted with off-the-shelf…
Speech separation has been very successful with deep learning techniques. Substantial effort has been reported based on approaches over spectrogram, which is well known as the standard time-and-frequency cross-domain representation for…
While traditional audio visualization methods depict amplitude intensities vs. time, such as in a time-frequency spectrogram, and while some may use complex phase information to augment the amplitude representation, such as in a reassigned…
Multi-resolution spectro-temporal features of a speech signal represent how the brain perceives sounds by tuning cortical cells to different spectral and temporal modulations. These features produce a higher dimensional representation of…
Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered…
Speech representation and modelling in high-dimensional spaces of acoustic waveforms, or a linear transformation thereof, is investigated with the aim of improving the robustness of automatic speech recognition to additive noise. The…
A new language model for speech recognition inspired by linguistic analysis is presented. The model develops hidden hierarchical structure incrementally and uses it to extract meaningful information from the word history - thus enabling the…
Modern day audio signal classification techniques lack the ability to classify low feature audio signals in the form of spectrographic temporal frequency data representations. Additionally, currently utilized techniques rely on full diverse…
Distinct striation patterns are observed in the spectrograms of speech and music. This motivated us to propose three novel time-frequency features for speech-music classification. These features are extracted in two stages. First, a preset…
Style transfer is a technique for combining two images based on the activations and feature statistics in a deep learning neural network architecture. This paper studies the analogous task in the audio domain and takes a critical look at…
Deep representation learning is a crucial procedure in multimedia analysis and attracts increasing attention. Most of the popular techniques rely on convolutional neural network and require a large amount of labeled data in the training…
Sound is a fundamental and rich source of information; playing a key role in many areas from humanities and social sciences through to engineering and mathematics. Sound is more than just data 'signals'. It encapsulates physical, sensorial…
Audio source separation is often achieved by estimating the magnitude spectrogram of each source, and then applying a phase recovery (or spectrogram inversion) algorithm to retrieve time-domain signals. Typically, spectrogram inversion is…
Spectrograms are 2D representations of sound that look very different from the images found in our visual world. And natural images, when played as spectrograms, make unnatural sounds. In this paper, we show that it is possible to…
This study investigates discriminative patterns learned by neural networks for accurate speech classification, with a specific focus on vowel classification tasks. By examining the activations and features of neural networks for vowel…