English
Related papers

Related papers: Chunked Attention-based Encoder-Decoder Model for …

200 papers

For most of the attention-based sequence-to-sequence models, the decoder predicts the output sequence conditioned on the entire input sequence processed by the encoder. The asynchronous problem between the encoding and decoding makes these…

Audio and Speech Processing · Electrical Eng. & Systems 2020-02-25 Zhengkun Tian , Jiangyan Yi , Ye Bai , Jianhua Tao , Shuai Zhang , Zhengqi Wen

Currently, there are mainly three kinds of Transformer encoder based streaming End to End (E2E) Automatic Speech Recognition (ASR) approaches, namely time-restricted methods, chunk-wise methods, and memory-based methods. Generally, all of…

Sound · Computer Science 2022-09-27 Fangyuan Wang , Bo Xu

In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and…

Audio and Speech Processing · Electrical Eng. & Systems 2020-02-18 Qian Zhang , Han Lu , Hasim Sak , Anshuman Tripathi , Erik McDermott , Stephen Koo , Shankar Kumar

In this paper, we present a novel two-pass approach to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. Our model adopts the hybrid CTC/attention architecture, in which the conformer layers in the…

Sound · Computer Science 2021-12-30 Binbin Zhang , Di Wu , Zhuoyuan Yao , Xiong Wang , Fan Yu , Chao Yang , Liyong Guo , Yaguang Hu , Lei Xie , Xin Lei

Punctuated text prediction is crucial for automatic speech recognition as it enhances readability and impacts downstream natural language processing tasks. In streaming scenarios, the ability to predict punctuation in real-time is…

Audio and Speech Processing · Electrical Eng. & Systems 2023-10-31 Hanbyul Kim , Seunghyun Seo , Lukas Lee , Seolki Baek

The RNN-Transducers and improved attention-based encoder-decoder models are widely applied to streaming speech recognition. Compared with these two end-to-end models, the CTC model is more efficient in training and inference. However, it…

Audio and Speech Processing · Electrical Eng. & Systems 2021-04-06 Zhengkun Tian , Jiangyan Yi , Ye Bai , Jianhua Tao , Shuai Zhang , Zhengqi Wen

End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models…

We propose Chunk-wise Attention Transducer (CHAT), a novel extension to RNN-T models that processes audio in fixed-size chunks while employing cross-attention within each chunk. This hybrid approach maintains RNN-T's streaming capability…

Machine Learning · Computer Science 2026-03-02 Hainan Xu , Vladimir Bataev , Travis M. Bartley , Jagadeesh Balam

Transducer and Attention based Encoder-Decoder (AED) are two widely used frameworks for speech-to-text tasks. They are designed for different purposes and each has its own benefits and drawbacks for speech-to-text tasks. In order to…

Computation and Language · Computer Science 2023-05-08 Yun Tang , Anna Y. Sun , Hirofumi Inaguma , Xinyue Chen , Ning Dong , Xutai Ma , Paden D. Tomasello , Juan Pino

Attention-based models have been gaining popularity recently for their strong performance demonstrated in fields such as machine translation and automatic speech recognition. One major challenge of attention-based models is the need of…

Computation and Language · Computer Science 2020-11-17 Ching-Feng Yeh , Yongqiang Wang , Yangyang Shi , Chunyang Wu , Frank Zhang , Julian Chan , Michael L. Seltzer

Current state-of-the-art machine translation systems are based on encoder-decoder architectures, that first encode the input sequence, and then generate an output sequence based on the input encoding. Both are interfaced with an attention…

Computation and Language · Computer Science 2018-11-02 Maha Elbayad , Laurent Besacier , Jakob Verbeek

This paper presents a new network architecture called multi-head decoder for end-to-end speech recognition as an extension of a multi-head attention model. In the multi-head attention model, multiple attentions are calculated, and then,…

Computation and Language · Computer Science 2018-07-31 Tomoki Hayashi , Shinji Watanabe , Tomoki Toda , Kazuya Takeda

Recently, Transformer has gained success in automatic speech recognition (ASR) field. However, it is challenging to deploy a Transformer-based end-to-end (E2E) model for online speech recognition. In this paper, we propose the…

Audio and Speech Processing · Electrical Eng. & Systems 2020-02-12 Haoran Miao , Gaofeng Cheng , Changfeng Gao , Pengyuan Zhang , Yonghong Yan

This work studies the use of attention masking in transformer transducer based speech recognition for building a single configurable model for different deployment scenarios. We present a comprehensive set of experiments comparing fixed…

Encoder-decoder models have become an effective approach for sequence learning tasks like machine translation, image captioning and speech recognition, but have yet to show competitive results for handwritten text recognition. To this end,…

Computer Vision and Pattern Recognition · Computer Science 2019-07-16 Johannes Michael , Roger Labahn , Tobias Grüning , Jochen Zöllner

Stream fusion, also known as system combination, is a common technique in automatic speech recognition for traditional hybrid hidden Markov model approaches, yet mostly unexplored for modern deep neural network end-to-end model…

Audio and Speech Processing · Electrical Eng. & Systems 2021-07-15 Timo Lohrenz , Zhengyang Li , Tim Fingscheidt

The Transformer self-attention network has recently shown promising performance as an alternative to recurrent neural networks in end-to-end (E2E) automatic speech recognition (ASR) systems. However, Transformer has a drawback in that the…

Audio and Speech Processing · Electrical Eng. & Systems 2019-10-29 Emiru Tsunoo , Yosuke Kashiwagi , Toshiyuki Kumakura , Shinji Watanabe

This work proposes a frame-wise online/streaming end-to-end neural diarization (EEND) method, which detects speaker activities in a frame-in-frame-out fashion. The proposed model mainly consists of a causal embedding encoder and an online…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-09 Di Liang , Xiaofei Li

Transformer-based models have dramatically increased their size and parameter count to tackle increasingly complex tasks. At the same time, there is a growing demand for high performance, low-latency inference on devices with limited…

Machine Learning · Computer Science 2026-04-01 Ginés Carreto Picón , Peng Yuan Zhou , Qi Zhang , Alexandros Iosifidis

Recurrent neural network transducers (RNN-T) have been successfully applied in end-to-end speech recognition. However, the recurrent structure makes it difficult for parallelization . In this paper, we propose a self-attention transducer…

Audio and Speech Processing · Electrical Eng. & Systems 2020-02-25 Zhengkun Tian , Jiangyan Yi , Jianhua Tao , Ye Bai , Zhengqi Wen
‹ Prev 1 2 3 10 Next ›