Related papers: Speech-Declipping Transformer with Complex Spectro…
Time Delay Neural Networks (TDNNs) are widely used in both DNN-HMM based hybrid speech recognition systems and recent end-to-end systems. Nevertheless, the receptive fields of TDNNs are limited and fixed, which is not desirable for tasks…
Clipping is a common nonlinear distortion that occurs whenever the input or output of an audio system exceeds the supported range. This phenomenon undermines not only the perception of speech quality but also downstream processes utilizing…
Speech separation models are used for isolating individual speakers in many speech processing applications. Deep learning models have been shown to lead to state-of-the-art (SOTA) results on a number of speech separation benchmarks. One…
In speech enhancement, complex neural network has shown promising performance due to their effectiveness in processing complex-valued spectrum. Most of the recent speech enhancement approaches mainly focus on wide-band signal with a…
Speaker verification (SV) aims to determine whether the speaker's identity of a test utterance is the same as the reference speech. In the past few years, extracting speaker embeddings using deep neural networks for SV systems has gone…
We propose an end-to-end speech enhancement method with trainable time-frequency~(T-F) transform based on invertible deep neural network~(DNN). The resent development of speech enhancement is brought by using DNN. The ordinary DNN-based…
Speech denoising (SD) is an important task of many, if not all, modern signal processing chains used in devices and for everyday-life applications. While there are many published and powerful deep neural network (DNN)-based methods for SD,…
We present a novel model designed for resource-efficient multichannel speech enhancement in the time domain, with a focus on low latency, lightweight, and low computational requirements. The proposed model incorporates explicit spatial and…
Recurrent neural networks (RNNs) have shown significant improvements in recent years for speech enhancement. However, the model complexity and inference time cost of RNNs are much higher than deep feed-forward neural networks (DNNs).…
In the process of recording, storage and transmission of time-domain audio signals, errors may be introduced that are difficult to correct in an unsupervised way. Here, we train a convolutional deep neural network to re-synthesize input…
Recent advances in self-supervised learning (SSL) on Transformers have significantly improved speaker verification (SV) by providing domain-general speech representations. However, existing approaches have underutilized the multi-layered…
Speech separation remains an important topic for multi-speaker technology researchers. Convolution augmented transformers (conformers) have performed well for many speech processing tasks but have been under-researched for speech…
This study proposes a fully convolutional network (FCN) model for raw waveform-based speech enhancement. The proposed system performs speech enhancement in an end-to-end (i.e., waveform-in and waveform-out) manner, which dif-fers from most…
We propose TF-GridNet for speech separation. The model is a novel deep neural network (DNN) integrating full- and sub-band modeling in the time-frequency (T-F) domain. It stacks several blocks, each consisting of an intra-frame full-band…
This paper proposes a deep neural network (DNN)-based multi-channel speech enhancement system in which a DNN is trained to maximize the quality of the enhanced time-domain signal. DNN-based multi-channel speech enhancement is often…
This paper introduces a deep neural network model for subband-based speech synthesizer. The model benefits from the short bandwidth of the subband signals to reduce the complexity of the time-domain speech generator. We employed the…
Current deep neural network (DNN) based speech separation faces a fundamental challenge -- while the models need to be trained on short segments due to computational constraints, real-world applications typically require processing…
In this paper, we introduce a spectral-domain inverse filtering approach for single-channel speech de-reverberation using deep convolutional neural network (CNN). The main goal is to better handle realistic reverberant conditions where the…
In this paper, we propose a transformer-based architecture, called two-stage transformer neural network (TSTNN) for end-to-end speech denoising in the time domain. The proposed model is composed of an encoder, a two-stage transformer module…
Conventional time-delay neural networks (TDNNs) struggle to handle long-range context, their ability to represent speaker information is therefore limited in long utterances. Existing solutions either depend on increasing model complexity…