Related papers: Chunked Attention-based Encoder-Decoder Model for …

Synchronous Transformers for End-to-End Speech Recognition

For most of the attention-based sequence-to-sequence models, the decoder predicts the output sequence conditioned on the entire input sequence processed by the encoder. The asynchronous problem between the encoding and decoding makes these…

Audio and Speech Processing · Electrical Eng. & Systems 2020-02-25 Zhengkun Tian , Jiangyan Yi , Ye Bai , Jianhua Tao , Shuai Zhang , Zhengqi Wen

Shifted Chunk Encoder for Transformer Based Streaming End-to-End ASR

Currently, there are mainly three kinds of Transformer encoder based streaming End to End (E2E) Automatic Speech Recognition (ASR) approaches, namely time-restricted methods, chunk-wise methods, and memory-based methods. Generally, all of…

Sound · Computer Science 2022-09-27 Fangyuan Wang , Bo Xu

Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and…

Audio and Speech Processing · Electrical Eng. & Systems 2020-02-18 Qian Zhang , Han Lu , Hasim Sak , Anshuman Tripathi , Erik McDermott , Stephen Koo , Shankar Kumar

Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition

In this paper, we present a novel two-pass approach to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. Our model adopts the hybrid CTC/attention architecture, in which the conformer layers in the…

Sound · Computer Science 2021-12-30 Binbin Zhang , Di Wu , Zhuoyuan Yao , Xiong Wang , Fan Yu , Chao Yang , Liyong Guo , Yaguang Hu , Lei Xie , Xin Lei

Improved Training for End-to-End Streaming Automatic Speech Recognition Model with Punctuation

Punctuated text prediction is crucial for automatic speech recognition as it enhances readability and impacts downstream natural language processing tasks. In streaming scenarios, the ability to predict punctuation in real-time is…

Audio and Speech Processing · Electrical Eng. & Systems 2023-10-31 Hanbyul Kim , Seunghyun Seo , Lukas Lee , Seolki Baek

One In A Hundred: Select The Best Predicted Sequence from Numerous Candidates for Streaming Speech Recognition

The RNN-Transducers and improved attention-based encoder-decoder models are widely applied to streaming speech recognition. Compared with these two end-to-end models, the CTC model is more efficient in training and inference. However, it…

Audio and Speech Processing · Electrical Eng. & Systems 2021-04-06 Zhengkun Tian , Jiangyan Yi , Ye Bai , Jianhua Tao , Shuai Zhang , Zhengqi Wen

Streaming End-to-end Speech Recognition For Mobile Devices

End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models…

Computation and Language · Computer Science 2018-11-19 Yanzhang He , Tara N. Sainath , Rohit Prabhavalkar , Ian McGraw , Raziel Alvarez , Ding Zhao , David Rybach , Anjuli Kannan , Yonghui Wu , Ruoming Pang , Qiao Liang , Deepti Bhatia , Yuan Shangguan , Bo Li , Golan Pundak , Khe Chai Sim , Tom Bagby , Shuo-yiin Chang , Kanishka Rao , Alexander Gruenstein

Chunk-wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text

We propose Chunk-wise Attention Transducer (CHAT), a novel extension to RNN-T models that processes audio in fixed-size chunks while employing cross-attention within each chunk. This hybrid approach maintains RNN-T's streaming capability…

Machine Learning · Computer Science 2026-03-02 Hainan Xu , Vladimir Bataev , Travis M. Bartley , Jagadeesh Balam

Hybrid Transducer and Attention based Encoder-Decoder Modeling for Speech-to-Text Tasks

Transducer and Attention based Encoder-Decoder (AED) are two widely used frameworks for speech-to-text tasks. They are designed for different purposes and each has its own benefits and drawbacks for speech-to-text tasks. In order to…

Computation and Language · Computer Science 2023-05-08 Yun Tang , Anna Y. Sun , Hirofumi Inaguma , Xinyue Chen , Ning Dong , Xutai Ma , Paden D. Tomasello , Juan Pino

Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition

Attention-based models have been gaining popularity recently for their strong performance demonstrated in fields such as machine translation and automatic speech recognition. One major challenge of attention-based models is the need of…

Computation and Language · Computer Science 2020-11-17 Ching-Feng Yeh , Yongqiang Wang , Yangyang Shi , Chunyang Wu , Frank Zhang , Julian Chan , Michael L. Seltzer

Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction

Current state-of-the-art machine translation systems are based on encoder-decoder architectures, that first encode the input sequence, and then generate an output sequence based on the input encoding. Both are interfaced with an attention…

Computation and Language · Computer Science 2018-11-02 Maha Elbayad , Laurent Besacier , Jakob Verbeek

Multi-Head Decoder for End-to-End Speech Recognition

This paper presents a new network architecture called multi-head decoder for end-to-end speech recognition as an extension of a multi-head attention model. In the multi-head attention model, multiple attentions are calculated, and then,…

Computation and Language · Computer Science 2018-07-31 Tomoki Hayashi , Shinji Watanabe , Tomoki Toda , Kazuya Takeda

Transformer-based Online CTC/attention End-to-End Speech Recognition Architecture

Recently, Transformer has gained success in automatic speech recognition (ASR) field. However, it is challenging to deploy a Transformer-based end-to-end (E2E) model for online speech recognition. In this paper, we propose the…

Audio and Speech Processing · Electrical Eng. & Systems 2020-02-12 Haoran Miao , Gaofeng Cheng , Changfeng Gao , Pengyuan Zhang , Yonghong Yan

Variable Attention Masking for Configurable Transformer Transducer Speech Recognition

This work studies the use of attention masking in transformer transducer based speech recognition for building a single configurable model for different deployment scenarios. We present a comprehensive set of experiments comparing fixed…

Audio and Speech Processing · Electrical Eng. & Systems 2023-04-19 Pawel Swietojanski , Stefan Braun , Dogan Can , Thiago Fraga da Silva , Arnab Ghoshal , Takaaki Hori , Roger Hsiao , Henry Mason , Erik McDermott , Honza Silovsky , Ruchir Travadi , Xiaodan Zhuang

Evaluating Sequence-to-Sequence Models for Handwritten Text Recognition

Encoder-decoder models have become an effective approach for sequence learning tasks like machine translation, image captioning and speech recognition, but have yet to show competitive results for handwritten text recognition. To this end,…

Computer Vision and Pattern Recognition · Computer Science 2019-07-16 Johannes Michael , Roger Labahn , Tobias Grüning , Jochen Zöllner

Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition

Stream fusion, also known as system combination, is a common technique in automatic speech recognition for traditional hybrid hidden Markov model approaches, yet mostly unexplored for modern deep neural network end-to-end model…

Audio and Speech Processing · Electrical Eng. & Systems 2021-07-15 Timo Lohrenz , Zhengyang Li , Tim Fingscheidt

Towards Online End-to-end Transformer Automatic Speech Recognition

The Transformer self-attention network has recently shown promising performance as an alternative to recurrent neural networks in end-to-end (E2E) automatic speech recognition (ASR) systems. However, Transformer has a drawback in that the…

Audio and Speech Processing · Electrical Eng. & Systems 2019-10-29 Emiru Tsunoo , Yosuke Kashiwagi , Toshiyuki Kumakura , Shinji Watanabe

LS-EEND: Long-Form Streaming End-to-End Neural Diarization with Online Attractor Extraction

This work proposes a frame-wise online/streaming end-to-end neural diarization (EEND) method, which detects speaker activities in a frame-in-frame-out fashion. The proposed model mainly consists of a causal embedding encoder and an online…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-09 Di Liang , Xiaofei Li

DeepCoT: Deep Continual Transformers for Real-Time Inference on Data Streams

Transformer-based models have dramatically increased their size and parameter count to tackle increasingly complex tasks. At the same time, there is a growing demand for high performance, low-latency inference on devices with limited…

Machine Learning · Computer Science 2026-04-01 Ginés Carreto Picón , Peng Yuan Zhou , Qi Zhang , Alexandros Iosifidis

Self-Attention Transducers for End-to-End Speech Recognition

Recurrent neural network transducers (RNN-T) have been successfully applied in end-to-end speech recognition. However, the recurrent structure makes it difficult for parallelization . In this paper, we propose a self-attention transducer…

Audio and Speech Processing · Electrical Eng. & Systems 2020-02-25 Zhengkun Tian , Jiangyan Yi , Jianhua Tao , Ye Bai , Zhengqi Wen