Related papers: High Performance Sequence-to-Sequence Model for St…

On using 2D sequence-to-sequence models for speech recognition

Attention-based sequence-to-sequence models have shown promising results in automatic speech recognition. Using these architectures, one-dimensional input and output sequences are related by an attention approach, thereby replacing more…

Computation and Language · Computer Science 2019-11-21 Parnia Bahar , Albert Zeyer , Ralf Schlüter , Hermann Ney

Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition

Attention-based models have been gaining popularity recently for their strong performance demonstrated in fields such as machine translation and automatic speech recognition. One major challenge of attention-based models is the need of…

Computation and Language · Computer Science 2020-11-17 Ching-Feng Yeh , Yongqiang Wang , Yangyang Shi , Chunyang Wu , Frank Zhang , Julian Chan , Michael L. Seltzer

State-of-the-art Speech Recognition With Sequence-to-Sequence Models

Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural…

Computation and Language · Computer Science 2018-02-26 Chung-Cheng Chiu , Tara N. Sainath , Yonghui Wu , Rohit Prabhavalkar , Patrick Nguyen , Zhifeng Chen , Anjuli Kannan , Ron J. Weiss , Kanishka Rao , Ekaterina Gonina , Navdeep Jaitly , Bo Li , Jan Chorowski , Michiel Bacchiani

Multi-rate attention architecture for fast streamable Text-to-speech spectrum modeling

Typical high quality text-to-speech (TTS) systems today use a two-stage architecture, with a spectrum model stage that generates spectral frames and a vocoder stage that generates the actual audio. High-quality spectrum models usually…

Sound · Computer Science 2021-04-05 Qing He , Zhiping Xiu , Thilo Koehler , Jilong Wu

Encoder-decoder with Focus-mechanism for Sequence Labelling Based Spoken Language Understanding

This paper investigates the framework of encoder-decoder with attention for sequence labelling based spoken language understanding. We introduce Bidirectional Long Short Term Memory - Long Short Term Memory networks (BLSTM-LSTM) as the…

Computation and Language · Computer Science 2017-03-14 Su Zhu , Kai Yu

Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR

Recently, a few novel streaming attention-based sequence-to-sequence (S2S) models have been proposed to perform online speech recognition with linear-time decoding complexity. However, in these models, the decisions to generate tokens are…

Computation and Language · Computer Science 2020-05-18 Hirofumi Inaguma , Yashesh Gaur , Liang Lu , Jinyu Li , Yifan Gong

Multi-Dialect Speech Recognition With A Single Sequence-To-Sequence Model

Sequence-to-sequence models provide a simple and elegant solution for building speech recognition systems by folding separate components of a typical system, namely acoustic (AM), pronunciation (PM) and language (LM) models into a single…

Audio and Speech Processing · Electrical Eng. & Systems 2017-12-06 Bo Li , Tara N. Sainath , Khe Chai Sim , Michiel Bacchiani , Eugene Weinstein , Patrick Nguyen , Zhifeng Chen , Yonghui Wu , Kanishka Rao

High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency

This paper presents an end-to-end text-to-speech system with low latency on a CPU, suitable for real-time applications. The system is composed of an autoregressive attention-based sequence-to-sequence acoustic model and the LPCNet vocoder…

Sound · Computer Science 2021-11-18 Nikolaos Ellinas , Georgios Vamvoukakis , Konstantinos Markopoulos , Aimilios Chalamandaris , Georgia Maniati , Panos Kakoulidis , Spyros Raptis , June Sig Sung , Hyoungmin Park , Pirros Tsiakoulis

An online sequence-to-sequence model for noisy speech recognition

Generative models have long been the dominant approach for speech recognition. The success of these models however relies on the use of sophisticated recipes and complicated machinery that is not easily accessible to non-practitioners.…

Computation and Language · Computer Science 2017-06-21 Chung-Cheng Chiu , Dieterich Lawson , Yuping Luo , George Tucker , Kevin Swersky , Ilya Sutskever , Navdeep Jaitly

Constructing Long Short-Term Memory based Deep Recurrent Neural Networks for Large Vocabulary Speech Recognition

Long short-term memory (LSTM) based acoustic modeling methods have recently been shown to give state-of-the-art performance on some speech recognition tasks. To achieve a further performance improvement, in this research, deep extensions on…

Computation and Language · Computer Science 2015-05-12 Xiangang Li , Xihong Wu

High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model

While the community keeps promoting end-to-end models over conventional hybrid models, which usually are long short-term memory (LSTM) models trained with a cross entropy criterion followed by a sequence discriminative training criterion,…

Audio and Speech Processing · Electrical Eng. & Systems 2020-03-18 Jinyu Li , Rui Zhao , Eric Sun , Jeremy H. M. Wong , Amit Das , Zhong Meng , Yifan Gong

Towards Relevance and Sequence Modeling in Language Recognition

The task of automatic language identification (LID) involving multiple dialects of the same language family in the presence of noise is a challenging problem. In these scenarios, the identity of the language/dialect may be reliably present…

Audio and Speech Processing · Electrical Eng. & Systems 2020-04-06 Bharat Padi , Anand Mohan , Sriram Ganapathy

End-to-End Attention-based Large Vocabulary Speech Recognition

Many of the current state-of-the-art Large Vocabulary Continuous Speech Recognition Systems (LVCSR) are hybrids of neural networks and Hidden Markov Models (HMMs). Most of these systems contain separate components that deal with the…

Computation and Language · Computer Science 2016-03-16 Dzmitry Bahdanau , Jan Chorowski , Dmitriy Serdyuk , Philemon Brakel , Yoshua Bengio

Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling

We introduce Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning. Sequence-to-sequence generation is often cast in an offline manner, where the model consumes the complete input…

Computation and Language · Computer Science 2025-09-30 Neil Zeghidour , Eugene Kharitonov , Manu Orsini , Václav Volhejn , Gabriel de Marmiesse , Edouard Grave , Patrick Pérez , Laurent Mazaré , Alexandre Défossez

Analysis of memory in LSTM-RNNs for source separation

Long short-term memory recurrent neural networks (LSTM-RNNs) are considered state-of-the art in many speech processing tasks. The recurrence in the network, in principle, allows any input to be remembered for an indefinite time, a feature…

Audio and Speech Processing · Electrical Eng. & Systems 2020-09-02 Jeroen Zegers , Hugo Van hamme

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding

Speech-to-text translation (ST), which translates source language speech into target language text, has attracted intensive attention in recent years. Compared to the traditional pipeline system, the end-to-end ST model has potential…

Computation and Language · Computer Science 2019-12-17 Yuchen Liu , Jiajun Zhang , Hao Xiong , Long Zhou , Zhongjun He , Hua Wu , Haifeng Wang , Chengqing Zong

Streaming Speech-to-Confusion Network Speech Recognition

In interactive automatic speech recognition (ASR) systems, low-latency requirements limit the amount of search space that can be explored during decoding, particularly in end-to-end neural ASR. In this paper, we present a novel streaming…

Audio and Speech Processing · Electrical Eng. & Systems 2024-01-29 Denis Filimonov , Prabhat Pandey , Ariya Rastrow , Ankur Gandhe , Andreas Stolcke

Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens

Cascaded speech-to-speech translation systems often suffer from the error accumulation problem and high latency, which is a result of cascaded modules whose inference delays accumulate. In this paper, we propose a transducer-based speech…

Audio and Speech Processing · Electrical Eng. & Systems 2024-10-07 Jinzheng Zhao , Niko Moritz , Egor Lakomkin , Ruiming Xie , Zhiping Xiu , Katerina Zmolikova , Zeeshan Ahmed , Yashesh Gaur , Duc Le , Christian Fuegen

A Deep Learning Framework for Sequence Mining with Bidirectional LSTM and Multi-Scale Attention

This paper addresses the challenges of mining latent patterns and modeling contextual dependencies in complex sequence data. A sequence pattern mining algorithm is proposed by integrating Bidirectional Long Short-Term Memory (BiLSTM) with a…

Machine Learning · Computer Science 2025-04-22 Tao Yang , Yu Cheng , Yaokun Ren , Yujia Lou , Minggu Wei , Honghui Xin

End-to-End Visual Speech Recognition for Small-Scale Datasets

Visual speech recognition models traditionally consist of two stages, feature extraction and classification. Several deep learning approaches have been recently presented aiming to replace the feature extraction stage by automatically…

Computer Vision and Pattern Recognition · Computer Science 2019-07-10 Stavros Petridis , Yujiang Wang , Pingchuan Ma , Zuwei Li , Maja Pantic