Related papers: Transformer with Bidirectional Decoder for Speech …

Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation

We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST). Our models are based on the original Transformer architecture (Vaswani et…

Computation and Language · Computer Science 2020-11-21 Hang Le , Juan Pino , Changhan Wang , Jiatao Gu , Didier Schwab , Laurent Besacier

Non-autoregressive Transformer with Unified Bidirectional Decoder for Automatic Speech Recognition

Non-autoregressive (NAR) transformer models have been studied intensively in automatic speech recognition (ASR), and a substantial part of NAR transformer models is to use the casual mask to limit token dependencies. However, the casual…

Computation and Language · Computer Science 2021-09-15 Chuan-Fei Zhang , Yan Liu , Tian-Hao Zhang , Song-Lu Chen , Feng Chen , Xu-Cheng Yin

Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers

This paper addresses end-to-end automatic speech recognition (ASR) for long audio recordings such as lecture and conversational speeches. Most end-to-end ASR models are designed to recognize independent utterances, but contextual…

Computation and Language · Computer Science 2021-04-20 Takaaki Hori , Niko Moritz , Chiori Hori , Jonathan Le Roux

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding

Speech-to-text translation (ST), which translates source language speech into target language text, has attracted intensive attention in recent years. Compared to the traditional pipeline system, the end-to-end ST model has potential…

Computation and Language · Computer Science 2019-12-17 Yuchen Liu , Jiajun Zhang , Hao Xiong , Long Zhou , Zhongjun He , Hua Wu , Haifeng Wang , Chengqing Zong

Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism

Transformer-based models have demonstrated their effectiveness in automatic speech recognition (ASR) tasks and even shown superior performance over the conventional hybrid framework. The main idea of Transformers is to capture the…

Sound · Computer Science 2022-07-05 Kun Wei , Pengcheng Guo , Ning Jiang

Efficient Bidirectional Neural Machine Translation

The encoder-decoder based neural machine translation usually generates a target sequence token by token from left to right. Due to error propagation, the tokens in the right side of the generated sequence are usually of poorer quality than…

Computation and Language · Computer Science 2019-08-27 Xu Tan , Yingce Xia , Lijun Wu , Tao Qin

Improving Cross-Lingual Transfer Learning for End-to-End Speech Recognition with Speech Translation

Transfer learning from high-resource languages is known to be an efficient way to improve end-to-end automatic speech recognition (ASR) for low-resource languages. Pre-trained or jointly trained encoder-decoder models, however, do not share…

Audio and Speech Processing · Electrical Eng. & Systems 2020-10-12 Changhan Wang , Juan Pino , Jiatao Gu

Sequence Generation: From Both Sides to the Middle

The encoder-decoder framework has achieved promising process for many sequence generation tasks, such as neural machine translation and text summarization. Such a framework usually generates a sequence token by token from left to right,…

Computation and Language · Computer Science 2019-06-25 Long Zhou , Jiajun Zhang , Chengqing Zong , Heng Yu

Synchronous Bidirectional Inference for Neural Sequence Generation

In sequence to sequence generation tasks (e.g. machine translation and abstractive summarization), inference is generally performed in a left-to-right manner to produce the result token by token. The neural approaches, such as LSTM and…

Computation and Language · Computer Science 2019-02-26 Jiajun Zhang , Long Zhou , Yang Zhao , Chengqing Zong

Synchronous Bidirectional Neural Machine Translation

Existing approaches to neural machine translation (NMT) generate the target language sequence token by token from left to right. However, this kind of unidirectional decoding framework cannot make full use of the target-side future contexts…

Computation and Language · Computer Science 2019-05-14 Long Zhou , Jiajun Zhang , Chengqing Zong

Cross-Modal Transformer-Based Neural Correction Models for Automatic Speech Recognition

We propose a cross-modal transformer-based neural correction models that refines the output of an automatic speech recognition (ASR) system so as to exclude ASR errors. Generally, neural correction models are composed of encoder-decoder…

Computation and Language · Computer Science 2021-07-06 Tomohiro Tanaka , Ryo Masumura , Mana Ihori , Akihiko Takashima , Takafumi Moriya , Takanori Ashihara , Shota Orihashi , Naoki Makishima

Non-autoregressive Transformer-based End-to-end ASR using BERT

Transformer-based models have led to significant innovation in classical and practical subjects as varied as speech processing, natural language processing, and computer vision. On top of the Transformer, attention-based end-to-end…

Computation and Language · Computer Science 2022-05-19 Fu-Hao Yu , Kuan-Yu Chen

Streaming automatic speech recognition with the transformer model

Encoder-decoder based sequence-to-sequence models have demonstrated state-of-the-art results in end-to-end automatic speech recognition (ASR). Recently, the transformer architecture, which uses self-attention to model temporal context…

Sound · Computer Science 2020-07-02 Niko Moritz , Takaaki Hori , Jonathan Le Roux

Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and Backward Transformers

This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR). The speech-to-text alignment is a problem of splitting long audio recordings with un-aligned transcripts into…

Audio and Speech Processing · Electrical Eng. & Systems 2021-04-22 Yusuke Kida , Tatsuya Komatsu , Masahito Togami

Large-Scale Streaming End-to-End Speech Translation with Neural Transducers

Neural transducers have been widely used in automatic speech recognition (ASR). In this paper, we introduce it to streaming end-to-end speech translation (ST), which aims to convert audio signals to texts in other languages directly.…

Computation and Language · Computer Science 2022-07-05 Jian Xue , Peidong Wang , Jinyu Li , Matt Post , Yashesh Gaur

Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement

Recently, attention-based transformers have become a de facto standard in many deep learning applications including natural language processing, computer vision, signal processing, etc.. In this paper, we propose a transformer-based…

Sound · Computer Science 2024-09-04 Tathagata Bandyopadhyay

Asynchronous Bidirectional Decoding for Neural Machine Translation

The dominant neural machine translation (NMT) models apply unified attentional encoder-decoder neural networks for translation. Traditionally, the NMT decoders adopt recurrent neural networks (RNNs) to perform translation in a left-toright…

Computation and Language · Computer Science 2018-02-06 Xiangwen Zhang , Jinsong Su , Yue Qin , Yang Liu , Rongrong Ji , Hongji Wang

Insertion-Based Modeling for End-to-End Automatic Speech Recognition

End-to-end (E2E) models have gained attention in the research field of automatic speech recognition (ASR). Many E2E models proposed so far assume left-to-right autoregressive generation of an output token sequence except for connectionist…

Audio and Speech Processing · Electrical Eng. & Systems 2020-11-17 Yuya Fujita , Shinji Watanabe , Motoi Omachi , Xuankai Chan

Transformer ASR with Contextual Block Processing

The Transformer self-attention network has recently shown promising performance as an alternative to recurrent neural networks (RNNs) in end-to-end (E2E) automatic speech recognition (ASR) systems. However, the Transformer has a drawback in…

Audio and Speech Processing · Electrical Eng. & Systems 2019-10-17 Emiru Tsunoo , Yosuke Kashiwagi , Toshiyuki Kumakura , Shinji Watanabe

Text-Conditioned Transformer for Automatic Pronunciation Error Detection

Automatic pronunciation error detection (APED) plays an important role in the domain of language learning. As for the previous ASR-based APED methods, the decoded results need to be aligned with the target text so that the errors can be…

Audio and Speech Processing · Electrical Eng. & Systems 2021-05-06 Zhan Zhang , Yuehai Wang , Jianyi Yang