Related papers: Learning Multiscale Features Directly From Wavefor…

Learning Environmental Sounds with Multi-scale Convolutional Neural Network

Deep learning has dramatically improved the performance of sounds recognition. However, learning acoustic models directly from the raw waveform is still challenging. Current waveform-based models generally use time-domain convolutional…

Sound · Computer Science 2018-03-29 Boqing Zhu , Changjian Wang , Feng Liu , Jin Lei , Zengquan Lu , Yuxing Peng

Speech Denoising with Auditory Models

Contemporary speech enhancement predominantly relies on audio transforms that are trained to reconstruct a clean speech waveform. The development of high-performing neural network sound recognition systems has raised the possibility of…

Audio and Speech Processing · Electrical Eng. & Systems 2025-11-18 Mark R. Saddler , Andrew Francl , Jenelle Feather , Kaizhi Qian , Yang Zhang , Josh H. McDermott

Y-Vector: Multiscale Waveform Encoder for Speaker Embedding

State-of-the-art text-independent speaker verification systems typically use cepstral features or filter bank energies as speech features. Recent studies attempted to extract speaker embeddings directly from raw waveforms and have shown…

Audio and Speech Processing · Electrical Eng. & Systems 2021-06-10 Ge Zhu , Fei Jiang , Zhiyao Duan

Speech and Speaker Recognition from Raw Waveform with SincNet

Deep neural networks can learn complex and abstract representations, that are progressively obtained by combining simpler ones. A recent trend in speech and speaker recognition consists in discovering these representations starting from raw…

Audio and Speech Processing · Electrical Eng. & Systems 2019-02-26 Mirco Ravanelli , Yoshua Bengio

Speech Recognition Front End Without Information Loss

Speech representation and modelling in high-dimensional spaces of acoustic waveforms, or a linear transformation thereof, is investigated with the aim of improving the robustness of automatic speech recognition to additive noise. The…

Computation and Language · Computer Science 2015-03-31 Matthew Ager , Zoran Cvetkovic , Peter Sollich

Fully Convolutional Speech Recognition

Current state-of-the-art speech recognition systems build on recurrent neural networks for acoustic and/or language modeling, and rely on feature extraction pipelines to extract mel-filterbanks or cepstral coefficients. In this paper we…

Computation and Language · Computer Science 2019-04-10 Neil Zeghidour , Qiantong Xu , Vitaliy Liptchinsky , Nicolas Usunier , Gabriel Synnaeve , Ronan Collobert

Learning Sparse Wavelet Representations

In this work we propose a method for learning wavelet filters directly from data. We accomplish this by framing the discrete wavelet transform as a modified convolutional neural network. We introduce an autoencoder wavelet transform network…

Machine Learning · Computer Science 2018-02-09 Daniel Recoskie , Richard Mann

State Sequences Prediction via Fourier Transform for Representation Learning

While deep reinforcement learning (RL) has been demonstrated effective in solving complex control tasks, sample efficiency remains a key challenge due to the large amounts of data required for remarkable performance. Existing research…

Machine Learning · Computer Science 2023-10-25 Mingxuan Ye , Yufei Kuang , Jie Wang , Rui Yang , Wengang Zhou , Houqiang Li , Feng Wu

Investigation of Time-Frequency Feature Combinations with Histogram Layer Time Delay Neural Networks

While deep learning has reduced the prevalence of manual feature extraction, transformation of data via feature engineering remains essential for improving model performance, particularly for underwater acoustic signals. The methods by…

Sound · Computer Science 2025-03-19 Amirmohammad Mohammadi , Iren'e Masabarakiza , Ethan Barnes , Davelle Carreiro , Alexandra Van Dine , Joshua Peeples

End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input

Sound event detection systems typically consist of two stages: extracting hand-crafted features from the raw audio waveform, and learning a mapping between these features and the target sound events using a classifier. Recently, the focus…

Sound · Computer Science 2018-05-11 Emre Çakır , Tuomas Virtanen

Learning audio representations via phase prediction

We learn audio representations by solving a novel self-supervised learning task, which consists of predicting the phase of the short-time Fourier transform from its magnitude. A convolutional encoder is used to map the magnitude spectrum of…

Audio and Speech Processing · Electrical Eng. & Systems 2019-10-29 Félix de Chaumont Quitry , Marco Tagliasacchi , Dominik Roblek

Microphone Array Signal Processing and Deep Learning for Speech Enhancement

Multi-channel acoustic signal processing is a well-established and powerful tool to exploit the spatial diversity between a target signal and non-target or noise sources for signal enhancement. However, the textbook solutions for optimal…

Audio and Speech Processing · Electrical Eng. & Systems 2025-01-14 Reinhold Haeb-Umbach , Tomohiro Nakatani , Marc Delcroix , Christoph Boeddeker , Tsubasa Ochiai

Deep Fishing: Gradient Features from Deep Nets

Convolutional Networks (ConvNets) have recently improved image recognition performance thanks to end-to-end learning of deep feed-forward models from raw pixels. Deep learning is a marked departure from the previous state of the art, the…

Computer Vision and Pattern Recognition · Computer Science 2015-07-24 Albert Gordo , Adrien Gaidon , Florent Perronnin

Learnable Frequency Filters for Speech Feature Extraction in Speaker Verification

Mel-scale spectrum features are used in various recognition and classification tasks on speech signals. There is no reason to expect that these features are optimal for all different tasks, including speaker verification (SV). This paper…

Audio and Speech Processing · Electrical Eng. & Systems 2022-06-16 Jingyu Li , Yusheng Tian , Tan Lee

Filter then Attend: Improving attention-based Time Series Forecasting with Spectral Filtering

Transformer-based models are at the forefront in long time-series forecasting (LTSF). While in many cases, these models are able to achieve state of the art results, they suffer from a bias toward low-frequencies in the data and high…

Machine Learning · Computer Science 2026-05-13 Elisha Dayag , Nhat Thanh Van Tran , Jack Xin

Employing Discrete Fourier Transform in Representational Learning

Image Representation learning via input reconstruction is a common technique in machine learning for generating representations that can be effectively utilized by arbitrary downstream tasks. A well-established approach is using…

Neural and Evolutionary Computing · Computer Science 2025-06-10 Raoof HojatJalali , Edmondo Trentin

Efficient Transformer for Direct Speech Translation

The advent of Transformer-based models has surpassed the barriers of text. When working with speech, we must face a problem: the sequence length of an audio input is not suitable for the Transformer. To bypass this problem, a usual approach…

Computation and Language · Computer Science 2021-07-08 Belen Alastruey , Gerard I. Gállego , Marta R. Costa-jussà

Deep Learning Based Speech Beamforming

Multi-channel speech enhancement with ad-hoc sensors has been a challenging task. Speech model guided beamforming algorithms are able to recover natural sounding speech, but the speech models tend to be oversimplified or the inference would…

Computation and Language · Computer Science 2018-02-16 Kaizhi Qian , Yang Zhang , Shiyu Chang , Xuesong Yang , Dinei Florencio , Mark Hasegawa-Johnson

End-to-end Phoneme Sequence Recognition using Convolutional Neural Networks

Most phoneme recognition state-of-the-art systems rely on a classical neural network classifiers, fed with highly tuned features, such as MFCC or PLP features. Recent advances in ``deep learning'' approaches questioned such systems, but…

Machine Learning · Computer Science 2013-12-10 Dimitri Palaz , Ronan Collobert , Mathew Magimai. -Doss

Frequency learning for image classification

Machine learning applied to computer vision and signal processing is achieving results comparable to the human brain on specific tasks due to the great improvements brought by the deep neural networks (DNN). The majority of state-of-the-art…

Computer Vision and Pattern Recognition · Computer Science 2020-06-30 José Augusto Stuchi , Levy Boccato , Romis Attux