Related papers: An Asynchronous WFST-Based Decoder For Automatic S…

Continual Speech Learning with Fused Speech Features

Rapid growth in speech data demands adaptive models, as traditional static methods fail to keep pace with dynamic and diverse speech information. We introduce continuous speech learning, a new set-up targeting at bridging the adaptation gap…

Computation and Language · Computer Science 2025-06-04 Guitao Wang , Jinming Zhao , Hao Yang , Guilin Qi , Tongtong Wu , Gholamreza Haffari

Efficient Dynamic WFST Decoding for Personalized Language Models

We propose a two-layer cache mechanism to speed up dynamic WFST decoding with personalized language models. The first layer is a public cache that stores most of the static part of the graph. This is shared globally among all users. A…

Computation and Language · Computer Science 2019-10-24 Jun Liu , Jiedan Zhu , Vishal Kathuria , Fuchun Peng

Dynamic latency speech recognition with asynchronous revision

In this work we propose an inference technique, asynchronous revision, to unify streaming and non-streaming speech recognition models. Specifically, we achieve dynamic latency with only one model by using arbitrary right context during…

Audio and Speech Processing · Electrical Eng. & Systems 2020-11-04 Mingkun Huang , Meng Cai , Jun Zhang , Yang Zhang , Yongbin You , Yi He , Zejun Ma

Hybrid Decoding: Rapid Pass and Selective Detailed Correction for Sequence Models

Recently, Transformer-based encoder-decoder models have demonstrated strong performance in multilingual speech recognition. However, the decoder's autoregressive nature and large size introduce significant bottlenecks during inference.…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-28 Yunkyu Lim , Jihwan Park , Hyung Yong Kim , Hanbin Lee , Byeong-Yeol Kim

A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition

Attention-based recurrent neural encoder-decoder models present an elegant solution to the automatic speech recognition problem. This approach folds the acoustic model, pronunciation model, and language model into a single network and…

Audio and Speech Processing · Electrical Eng. & Systems 2018-11-08 Shubham Toshniwal , Anjuli Kannan , Chung-Cheng Chiu , Yonghui Wu , Tara N Sainath , Karen Livescu

RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition

The attention-based encoder-decoder framework has recently achieved impressive results for scene text recognition, and many variants have emerged with improvements in recognition quality. However, it performs poorly on contextless texts…

Computer Vision and Pattern Recognition · Computer Science 2020-07-20 Xiaoyu Yue , Zhanghui Kuang , Chenhao Lin , Hongbin Sun , Wayne Zhang

Synchronous Transformers for End-to-End Speech Recognition

For most of the attention-based sequence-to-sequence models, the decoder predicts the output sequence conditioned on the entire input sequence processed by the encoder. The asynchronous problem between the encoding and decoding makes these…

Audio and Speech Processing · Electrical Eng. & Systems 2020-02-25 Zhengkun Tian , Jiangyan Yi , Ye Bai , Jianhua Tao , Shuai Zhang , Zhengqi Wen

Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation

We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST). Our models are based on the original Transformer architecture (Vaswani et…

Computation and Language · Computer Science 2020-11-21 Hang Le , Juan Pino , Changhan Wang , Jiatao Gu , Didier Schwab , Laurent Besacier

Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech Separation

In speech separation, time-domain approaches have successfully replaced the time-frequency domain with latent sequence feature from a learnable encoder. Conventionally, the feature is separated into speaker-specific ones at the final stage…

Audio and Speech Processing · Electrical Eng. & Systems 2026-04-01 Ui-Hyeop Shin , Sangyoun Lee , Taehan Kim , Hyung-Min Park

IKFST: IOO and KOO Algorithms for Accelerated and Precise WFST-based End-to-End Automatic Speech Recognition

End-to-end automatic speech recognition has become the dominant paradigm in both academia and industry. To enhance recognition performance, the Weighted Finite-State Transducer (WFST) is widely adopted to integrate acoustic and language…

Sound · Computer Science 2026-01-05 Zhuoran Zhuang , Ye Chen , Chao Luo , Tian-Hao Zhang , Xuewei Zhang , Jian Ma , Jiatong Shi , Wei Zhang

First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs

We present a method to perform first-pass large vocabulary continuous speech recognition using only a neural network and language model. Deep neural network acoustic models are now commonplace in HMM-based speech recognition systems, but…

Computation and Language · Computer Science 2014-12-09 Awni Y. Hannun , Andrew L. Maas , Daniel Jurafsky , Andrew Y. Ng

Parallel Composition of Weighted Finite-State Transducers

Finite-state transducers (FSTs) are frequently used in speech recognition. Transducer composition is an essential operation for combining different sources of information at different granularities. However, composition is also one of the…

Computation and Language · Computer Science 2021-10-07 Shubho Sengupta , Vineel Pratap , Awni Hannun

A review of on-device fully neural end-to-end automatic speech recognition algorithms

In this paper, we review various end-to-end automatic speech recognition algorithms and their optimization techniques for on-device applications. Conventional speech recognition systems comprise a large number of discrete components such as…

Machine Learning · Computer Science 2021-08-30 Chanwoo Kim , Dhananjaya Gowda , Dongsoo Lee , Jiyeon Kim , Ankur Kumar , Sungsoo Kim , Abhinav Garg , Changwoo Han

DGFNet: End-to-End Audio-Visual Source Separation Based on Dynamic Gating Fusion

Current Audio-Visual Source Separation methods primarily adopt two design strategies. The first strategy involves fusing audio and visual features at the bottleneck layer of the encoder, followed by processing the fused features through the…

Sound · Computer Science 2025-05-01 Yinfeng Yu , Shiyu Sun

AdaST: Dynamically Adapting Encoder States in the Decoder for End-to-End Speech-to-Text Translation

In end-to-end speech translation, acoustic representations learned by the encoder are usually fixed and static, from the perspective of the decoder, which is not desirable for dealing with the cross-modal and cross-lingual challenge in…

Computation and Language · Computer Science 2025-03-19 Wuwei Huang , Dexin Wang , Deyi Xiong

Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

Conformer-based models have become the dominant end-to-end architecture for speech processing tasks. With the objective of enhancing the conformer architecture for efficient training and inference, we carefully redesigned Conformer with a…

Audio and Speech Processing · Electrical Eng. & Systems 2023-10-03 Dima Rekesh , Nithin Rao Koluguri , Samuel Kriman , Somshubra Majumdar , Vahid Noroozi , He Huang , Oleksii Hrinchuk , Krishna Puvvada , Ankur Kumar , Jagadeesh Balam , Boris Ginsburg

DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation

Direct speech-to-speech translation (S2ST) translates speech from one language into another using a single model. However, due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal…

Computation and Language · Computer Science 2023-10-12 Qingkai Fang , Yan Zhou , Yang Feng

Syntactically Supervised Transformers for Faster Neural Machine Translation

Standard decoders for neural machine translation autoregressively generate a single target token per time step, which slows inference especially for long outputs. While architectural advances such as the Transformer fully parallelize the…

Computation and Language · Computer Science 2020-10-06 Nader Akoury , Kalpesh Krishna , Mohit Iyyer

LSTM-LM with Long-Term History for First-Pass Decoding in Conversational Speech Recognition

LSTM language models (LSTM-LMs) have been proven to be powerful and yielded significant performance improvements over count based n-gram LMs in modern speech recognition systems. Due to its infinite history states and computational load,…

Computation and Language · Computer Science 2020-10-23 Xie Chen , Sarangarajan Parthasarathy , William Gale , Shuangyu Chang , Michael Zeng

Speech Recognition Front End Without Information Loss

Speech representation and modelling in high-dimensional spaces of acoustic waveforms, or a linear transformation thereof, is investigated with the aim of improving the robustness of automatic speech recognition to additive noise. The…

Computation and Language · Computer Science 2015-03-31 Matthew Ager , Zoran Cvetkovic , Peter Sollich