Related papers: Streaming Audio-Visual Speech Recognition with Ali…

An Investigation of Enhancing CTC Model for Triggered Attention-based Streaming ASR

In the present paper, an attempt is made to combine Mask-CTC and the triggered attention mechanism to construct a streaming end-to-end automatic speech recognition (ASR) system that provides high performance with low latency. The triggered…

Sound · Computer Science 2021-10-22 Huaibo Zhao , Yosuke Higuchi , Tetsuji Ogawa , Tetsunori Kobayashi

Alignment Knowledge Distillation for Online Streaming Attention-based Speech Recognition

This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems. AED models have achieved competitive performance in offline scenarios by jointly…

Audio and Speech Processing · Electrical Eng. & Systems 2021-08-24 Hirofumi Inaguma , Tatsuya Kawahara

Audio-Visual Efficient Conformer for Robust Speech Recognition

End-to-end Automatic Speech Recognition (ASR) systems based on neural networks have seen large improvements in recent years. The availability of large scale hand-labeled datasets and sufficient computing resources made it possible to train…

Computer Vision and Pattern Recognition · Computer Science 2023-01-05 Maxime Burchi , Radu Timofte

Dynamic Chunk Convolution for Unified Streaming and Non-Streaming Conformer ASR

Recently, there has been an increasing interest in unifying streaming and non-streaming speech recognition models to reduce development, training and deployment cost. The best-known approaches rely on either window-based or dynamic…

Audio and Speech Processing · Electrical Eng. & Systems 2023-04-27 Xilai Li , Goeric Huybrechts , Srikanth Ronanki , Jeff Farris , Sravan Bodapati

VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording

In this work, we propose novel decoding algorithms to enable streaming automatic speech recognition (ASR) on unsegmented long-form recordings without voice activity detection (VAD), based on monotonic chunkwise attention (MoChA) with an…

Audio and Speech Processing · Electrical Eng. & Systems 2021-07-16 Hirofumi Inaguma , Tatsuya Kawahara

Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies

There is often a trade-off between performance and latency in streaming automatic speech recognition (ASR). Traditional methods such as look-ahead and chunk-based methods, usually require information from future frames to advance…

Audio and Speech Processing · Electrical Eng. & Systems 2022-07-07 Zehan Li , Haoran Miao , Keqi Deng , Gaofeng Cheng , Sanli Tian , Ta Li , Yonghong Yan

Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM

We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR) model. We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network. The encoder is…

Computation and Language · Computer Science 2017-06-12 Takaaki Hori , Shinji Watanabe , Yu Zhang , William Chan

Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer

Humans are adept at leveraging visual cues from lip movements for recognizing speech in adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow similar approach to achieve robust speech recognition in noisy…

Audio and Speech Processing · Electrical Eng. & Systems 2024-05-24 Maxime Burchi , Krishna C. Puvvada , Jagadeesh Balam , Boris Ginsburg , Radu Timofte

Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture

Recently, there has been increasing progress in end-to-end automatic speech recognition (ASR) architecture, which transcribes speech to text without any pre-trained alignments. One popular end-to-end approach is the hybrid Connectionist…

Audio and Speech Processing · Electrical Eng. & Systems 2023-07-06 Haoran Miao , Gaofeng Cheng , Pengyuan Zhang , Yonghong Yan

Streaming automatic speech recognition with the transformer model

Encoder-decoder based sequence-to-sequence models have demonstrated state-of-the-art results in end-to-end automatic speech recognition (ASR). Recently, the transformer architecture, which uses self-attention to model temporal context…

Sound · Computer Science 2020-07-02 Niko Moritz , Takaaki Hori , Jonathan Le Roux

An improved hybrid CTC-Attention model for speech recognition

Recently, end-to-end speech recognition with a hybrid model consisting of the connectionist temporal classification(CTC) and the attention encoder-decoder achieved state-of-the-art results. In this paper, we propose a novel CTC decoder…

Sound · Computer Science 2018-11-02 Zhe Yuan , Zhuoran Lyu , Jiwei Li , Xi Zhou

Multi-Stream End-to-End Speech Recognition

Attention-based methods and Connectionist Temporal Classification (CTC) network have been promising research directions for end-to-end (E2E) Automatic Speech Recognition (ASR). The joint CTC/Attention model has achieved great success by…

Audio and Speech Processing · Electrical Eng. & Systems 2019-10-22 Ruizhi Li , Xiaofei Wang , Sri Harish Mallidi , Shinji Watanabe , Takaaki Hori , Hynek Hermansky

Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition

In this paper, we present a novel two-pass approach to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. Our model adopts the hybrid CTC/attention architecture, in which the conformer layers in the…

Sound · Computer Science 2021-12-30 Binbin Zhang , Di Wu , Zhuoyuan Yao , Xiong Wang , Fan Yu , Chao Yang , Liyong Guo , Yaguang Hu , Lei Xie , Xin Lei

Stream attention-based multi-array end-to-end speech recognition

Automatic Speech Recognition (ASR) using multiple microphone arrays has achieved great success in the far-field robustness. Taking advantage of all the information that each array shares and contributes is crucial in this task. Motivated by…

Computation and Language · Computer Science 2019-02-20 Xiaofei Wang , Ruizhi Li , Sri Harish Mallid , Takaaki Hori , Shinji Watanabe , Hynek Hermansky

Contextual Biasing for Streaming ASR via CTC-based Word Spotting

Contextual biasing is essential to improving the recognition of rare and domain-specific words in an automatic speech recognition (ASR) system. While numerous methods have been proposed in recent years, most of them focus on offline…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-20 Kai-Chen Tsai , Tien-Hong Lo , Yun-Ting Sun , Berlin Chen

Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization

Unification of automatic speech recognition (ASR) systems reduces development and maintenance costs, but training a single model to perform well in both offline and low-latency streaming settings remains challenging. We present a Unified…

Audio and Speech Processing · Electrical Eng. & Systems 2026-04-22 Andrei Andrusenko , Vladimir Bataev , Lilit Grigoryan , Nune Tadevosyan , Vitaly Lavrukhin , Boris Ginsburg

Streaming Keyword Spotting Boosted by Cross-layer Discrimination Consistency

Connectionist Temporal Classification (CTC), a non-autoregressive training criterion, is widely used in online keyword spotting (KWS). However, existing CTC-based KWS decoding strategies either rely on Automatic Speech Recognition (ASR),…

Audio and Speech Processing · Electrical Eng. & Systems 2024-12-25 Yu Xi , Haoyu Li , Xiaoyu Gu , Hao Li , Yidi Jiang , Kai Yu

Contextual-Utterance Training for Automatic Speech Recognition

Recent studies of streaming automatic speech recognition (ASR) recurrent neural network transducer (RNN-T)-based systems have fed the encoder with past contextual information in order to improve its word error rate (WER) performance. In…

Audio and Speech Processing · Electrical Eng. & Systems 2022-10-31 Alejandro Gomez-Alanis , Lukas Drude , Andreas Schwarz , Rupak Vignesh Swaminathan , Simon Wiesler

Transformer-based Streaming ASR with Cumulative Attention

In this paper, we propose an online attention mechanism, known as cumulative attention (CA), for streaming Transformer-based automatic speech recognition (ASR). Inspired by monotonic chunkwise attention (MoChA) and head-synchronous…

Audio and Speech Processing · Electrical Eng. & Systems 2022-03-14 Mohan Li , Shucong Zhang , Catalin Zorila , Rama Doddipatla

Bridging the gap between streaming and non-streaming ASR systems bydistilling ensembles of CTC and RNN-T models

Streaming end-to-end automatic speech recognition (ASR) systems are widely used in everyday applications that require transcribing speech to text in real-time. Their minimal latency makes them suitable for such tasks. Unlike their…

Computation and Language · Computer Science 2021-04-30 Thibault Doutre , Wei Han , Chung-Cheng Chiu , Ruoming Pang , Olivier Siohan , Liangliang Cao