Related papers: A Fully Differentiable Beam Search Decoder

Efficient Sequence Training of Attention Models using Approximative Recombination

Sequence discriminative training is a great tool to improve the performance of an automatic speech recognition system. It does, however, necessitate a sum over all possible word sequences, which is intractable to compute in practice.…

Computation and Language · Computer Science 2022-04-22 Nils-Philipp Wynands , Wilfried Michel , Jan Rosendahl , Ralf Schlüter , Hermann Ney

Robust Beam Search for Encoder-Decoder Attention Based Speech Recognition without Length Bias

As one popular modeling approach for end-to-end speech recognition, attention-based encoder-decoder models are known to suffer the length bias and corresponding beam problem. Different approaches have been applied in simple beam search to…

Audio and Speech Processing · Electrical Eng. & Systems 2023-10-24 Wei Zhou , Ralf Schlüter , Hermann Ney

Acoustic Word Embedding System for Code-Switching Query-by-example Spoken Term Detection

In this paper, we propose a deep convolutional neural network-based acoustic word embedding system on code-switching query by example spoken term detection. Different from previous configurations, we combine audio data in two languages for…

Audio and Speech Processing · Electrical Eng. & Systems 2020-05-26 Murong Ma , Haiwei Wu , Xuyang Wang , Lin Yang , Junjie Wang , Ming Li

Determinantal Beam Search

Beam search is a go-to strategy for decoding neural sequence models. The algorithm can naturally be viewed as a subset optimization problem, albeit one where the corresponding set function does not reflect interactions between candidates.…

Computation and Language · Computer Science 2023-06-26 Clara Meister , Martina Forster , Ryan Cotterell

Deep Learning Based Speech Beamforming

Multi-channel speech enhancement with ad-hoc sensors has been a challenging task. Speech model guided beamforming algorithms are able to recover natural sounding speech, but the speech models tend to be oversimplified or the inference would…

Computation and Language · Computer Science 2018-02-16 Kaizhi Qian , Yang Zhang , Shiyu Chang , Xuesong Yang , Dinei Florencio , Mark Hasegawa-Johnson

Joint Beam Search Integrating CTC, Attention, and Transducer Decoders

End-to-end automatic speech recognition (E2E-ASR) can be classified by its decoder architectures, such as connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention-based encoder-decoder, and…

Audio and Speech Processing · Electrical Eng. & Systems 2025-01-15 Yui Sudo , Muhammad Shakeel , Yosuke Fukumoto , Brian Yan , Jiatong Shi , Yifan Peng , Shinji Watanabe

Noisy Parallel Approximate Decoding for Conditional Recurrent Language Model

Recent advances in conditional recurrent language modelling have mainly focused on network architectures (e.g., attention mechanism), learning algorithms (e.g., scheduled sampling and sequence-level training) and novel applications (e.g.,…

Computation and Language · Computer Science 2016-05-13 Kyunghyun Cho

A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition

Attention-based recurrent neural encoder-decoder models present an elegant solution to the automatic speech recognition problem. This approach folds the acoustic model, pronunciation model, and language model into a single network and…

Audio and Speech Processing · Electrical Eng. & Systems 2018-11-08 Shubham Toshniwal , Anjuli Kannan , Chung-Cheng Chiu , Yonghui Wu , Tara N Sainath , Karen Livescu

A Continuous Relaxation of Beam Search for End-to-end Training of Neural Sequence Models

Beam search is a desirable choice of test-time decoding algorithm for neural sequence models because it potentially avoids search errors made by simpler greedy methods. However, typical cross entropy training procedures for these models do…

Machine Learning · Computer Science 2017-10-10 Kartik Goyal , Graham Neubig , Chris Dyer , Taylor Berg-Kirkpatrick

Differentiable Supervector Extraction for Encoding Speaker and Phrase Information in Text Dependent Speaker Verification

In this paper, we propose a new differentiable neural network alignment mechanism for text-dependent speaker verification which uses alignment models to produce a supervector representation of an utterance. Unlike previous works with…

Sound · Computer Science 2018-12-27 Victoria Mingote , Antonio Miguel , Alfonso Ortega , Eduardo Lleida

Deep clustering: Discriminative embeddings for segmentation and separation

We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are…

Neural and Evolutionary Computing · Computer Science 2015-08-19 John R. Hershey , Zhuo Chen , Jonathan Le Roux , Shinji Watanabe

Recent Progresses in Deep Learning based Acoustic Models (Updated)

In this paper, we summarize recent progresses made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques. We first discuss acoustic models that can effectively exploit variable-length…

Audio and Speech Processing · Electrical Eng. & Systems 2018-04-30 Dong Yu , Jinyu Li

Multi-Head Decoder for End-to-End Speech Recognition

This paper presents a new network architecture called multi-head decoder for end-to-end speech recognition as an extension of a multi-head attention model. In the multi-head attention model, multiple attentions are calculated, and then,…

Computation and Language · Computer Science 2018-07-31 Tomoki Hayashi , Shinji Watanabe , Tomoki Toda , Kazuya Takeda

Beam Search with Bidirectional Strategies for Neural Response Generation

Sequence-to-sequence neural networks have been widely used in language-based applications as they have flexible capabilities to learn various language models. However, when seeking for the optimal language response through trained neural…

Computation and Language · Computer Science 2021-10-08 Pierre Colombo , Chouchang Yang , Giovanna Varni , Chloé Clavel

Acoustic-to-Word Recognition with Sequence-to-Sequence Models

Acoustic-to-Word recognition provides a straightforward solution to end-to-end speech recognition without needing external decoding, language model re-scoring or lexicon. While character-based models offer a natural solution to the…

Audio and Speech Processing · Electrical Eng. & Systems 2018-08-22 Shruti Palaskar , Florian Metze

Sequence-to-Sequence Models Can Directly Translate Foreign Speech

We present a recurrent encoder-decoder deep neural network architecture that directly translates speech in one language into text in another. The model does not explicitly transcribe the speech into text in the source language, nor does it…

Computation and Language · Computer Science 2017-06-13 Ron J. Weiss , Jan Chorowski , Navdeep Jaitly , Yonghui Wu , Zhifeng Chen

Biomimetic Frontend for Differentiable Audio Processing

While models in audio and speech processing are becoming deeper and more end-to-end, they as a consequence need expensive training on large data, and are often brittle. We build on a classical model of human hearing and make it…

Sound · Computer Science 2024-09-16 Ruolan Leslie Famularo , Dmitry N. Zotkin , Shihab A. Shamma , Ramani Duraiswami

High-Fidelity Noise Reduction with Differentiable Signal Processing

Noise reduction techniques based on deep learning have demonstrated impressive performance in enhancing the overall quality of recorded speech. While these approaches are highly performant, their application in audio engineering can be…

Sound · Computer Science 2023-10-18 Christian J. Steinmetz , Thomas Walther , Joshua D. Reiss

A Purely End-to-end System for Multi-speaker Speech Recognition

Recently, there has been growing interest in multi-speaker speech recognition, where the utterances of multiple speakers are recognized from their mixture. Promising techniques have been proposed for this task, but earlier works have…

Sound · Computer Science 2018-05-16 Hiroshi Seki , Takaaki Hori , Shinji Watanabe , Jonathan Le Roux , John R. Hershey

Fully Convolutional Speech Recognition

Current state-of-the-art speech recognition systems build on recurrent neural networks for acoustic and/or language modeling, and rely on feature extraction pipelines to extract mel-filterbanks or cepstral coefficients. In this paper we…

Computation and Language · Computer Science 2019-04-10 Neil Zeghidour , Qiantong Xu , Vitaliy Liptchinsky , Nicolas Usunier , Gabriel Synnaeve , Ronan Collobert