Related papers: Intermediate-layer output Regularization for Atten…

R-BI: Regularized Batched Inputs enhance Incremental Decoding Framework for Low-Latency Simultaneous Speech Translation

Incremental Decoding is an effective framework that enables the use of an offline model in a simultaneous setting without modifying the original model, making it suitable for Low-Latency Simultaneous Speech Translation. However, this…

Computation and Language · Computer Science 2024-01-12 Jiaxin Guo , Zhanglin Wu , Zongyao Li , Hengchao Shang , Daimeng Wei , Xiaoyu Chen , Zhiqiang Rao , Shaojun Li , Hao Yang

Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers

This study presents a novel approach for knowledge distillation (KD) from a BERT teacher model to an automatic speech recognition (ASR) model using intermediate layers. To distil the teacher's knowledge, we use an attention decoder that…

Computation and Language · Computer Science 2024-01-23 Michael Hentschel , Yuta Nishikawa , Tatsuya Komatsu , Yusuke Fujita

Intermediate Layer Optimization for Inverse Problems using Deep Generative Models

We propose Intermediate Layer Optimization (ILO), a novel optimization algorithm for solving inverse problems with deep generative models. Instead of optimizing only over the initial latent code, we progressively change the input layer…

Machine Learning · Computer Science 2021-02-16 Giannis Daras , Joseph Dean , Ajil Jalal , Alexandros G. Dimakis

Regularized Forward-Backward Decoder for Attention Models

Nowadays, attention models are one of the popular candidates for speech recognition. So far, many studies mainly focus on the encoder structure or the attention module to enhance the performance of these models. However, mostly ignore the…

Audio and Speech Processing · Electrical Eng. & Systems 2020-10-29 Tobias Watzel , Ludwig Kürzinger , Lujun Li , Gerhard Rigoll

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding

Speech-to-text translation (ST), which translates source language speech into target language text, has attracted intensive attention in recent years. Compared to the traditional pipeline system, the end-to-end ST model has potential…

Computation and Language · Computer Science 2019-12-17 Yuchen Liu , Jiajun Zhang , Hao Xiong , Long Zhou , Zhongjun He , Hua Wu , Haifeng Wang , Chengqing Zong

Improving Automatic Speech Recognition with Decoder-Centric Regularisation in Encoder-Decoder Models

This paper proposes a simple yet effective way of regularising the encoder-decoder-based automatic speech recognition (ASR) models that enhance the robustness of the model and improve the generalisation to out-of-domain scenarios. The…

Audio and Speech Processing · Electrical Eng. & Systems 2024-10-24 Alexander Polok , Santosh Kesiraju , Karel Beneš , Lukáš Burget , Jan Černocký

Boosting CTC-Based ASR Using LLM-Based Intermediate Loss Regularization

End-to-end (E2E) automatic speech recognition (ASR) systems have revolutionized the field by integrating all components into a single neural network, with attention-based encoder-decoder models achieving state-of-the-art performance.…

Computation and Language · Computer Science 2025-07-01 Duygu Altinok

Internal Language Model Estimation Through Explicit Context Vector Learning for Attention-based Encoder-decoder ASR

An end-to-end (E2E) ASR model implicitly learns a prior Internal Language Model (ILM) from the training transcripts. To fuse an external LM using Bayes posterior theory, the log likelihood produced by the ILM has to be accurately estimated…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-03 Yufei Liu , Rao Ma , Haihua Xu , Yi He , Zejun Ma , Weibin Zhang

Self-regularised Minimum Latency Training for Streaming Transformer-based Speech Recognition

This paper proposes a self-regularised minimum latency training (SR-MLT) method for streaming Transformer-based automatic speech recognition (ASR) systems. In previous works, latency was optimised by truncating the online attention weights…

Audio and Speech Processing · Electrical Eng. & Systems 2023-04-25 Mohan Li , Rama Doddipatla , Catalin Zorila

Investigating Methods to Improve Language Model Integration for Attention-based Encoder-Decoder ASR Models

Attention-based encoder-decoder (AED) models learn an implicit internal language model (ILM) from the training transcriptions. The integration with an external LM trained on much more unpaired text usually leads to better performance. A…

Computation and Language · Computer Science 2021-06-18 Mohammad Zeineldeen , Aleksandr Glushko , Wilfried Michel , Albert Zeyer , Ralf Schlüter , Hermann Ney

LABO: Towards Learning Optimal Label Regularization via Bi-level Optimization

Regularization techniques are crucial to improving the generalization performance and training efficiency of deep neural networks. Many deep learning algorithms rely on weight decay, dropout, batch/layer normalization to converge faster and…

Machine Learning · Computer Science 2025-05-23 Peng Lu , Ahmad Rashid , Ivan Kobyzev , Mehdi Rezagholizadeh , Philippe Langlais

Rethinking and Improving Natural Language Generation with Layer-Wise Multi-View Decoding

In sequence-to-sequence learning, e.g., natural language generation, the decoder relies on the attention mechanism to efficiently extract information from the encoder. While it is common practice to draw information from only the last…

Computation and Language · Computer Science 2022-08-30 Fenglin Liu , Xuancheng Ren , Guangxiang Zhao , Chenyu You , Xuewei Ma , Xian Wu , Xu Sun

Serialized Output Training for End-to-End Overlapped Speech Recognition

This paper proposes serialized output training (SOT), a novel framework for multi-speaker overlapped speech recognition based on an attention-based encoder-decoder approach. Instead of having multiple output layers as with the permutation…

Computation and Language · Computer Science 2020-08-11 Naoyuki Kanda , Yashesh Gaur , Xiaofei Wang , Zhong Meng , Takuya Yoshioka

Self-Supervised Learning for speech recognition with Intermediate layer supervision

Recently, pioneer work finds that speech pre-trained models can solve full-stack speech processing tasks, because the model utilizes bottom layers to learn speaker-related information and top layers to encode content-related information.…

Audio and Speech Processing · Electrical Eng. & Systems 2021-12-17 Chengyi Wang , Yu Wu , Sanyuan Chen , Shujie Liu , Jinyu Li , Yao Qian , Zhenglu Yang

Multitask Learning with Low-Level Auxiliary Tasks for Encoder-Decoder Based Speech Recognition

End-to-end training of deep learning-based models allows for implicit learning of intermediate representations based on the final task loss. However, the end-to-end approach ignores the useful domain knowledge encoded in explicit…

Computation and Language · Computer Science 2017-04-20 Shubham Toshniwal , Hao Tang , Liang Lu , Karen Livescu

Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments

In real-world applications, users often require both translations and transcriptions of speech to enhance their comprehension, particularly in streaming scenarios where incremental generation is necessary. This paper introduces a streaming…

Computation and Language · Computer Science 2023-10-03 Sara Papi , Peidong Wang , Junkun Chen , Jian Xue , Jinyu Li , Yashesh Gaur

Advancing Multi-talker ASR Performance with Large Language Models

Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR). Serialized output training (SOT) is a classic method to address multi-talker…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-02 Mohan Shi , Zengrui Jin , Yaoxun Xu , Yong Xu , Shi-Xiong Zhang , Kun Wei , Yiwen Shao , Chunlei Zhang , Dong Yu

Multitask Training with Text Data for End-to-End Speech Recognition

We propose a multitask training method for attention-based end-to-end speech recognition models. We regularize the decoder in a listen, attend, and spell model by multitask training it on both audio-text and text-only data. Trained on the…

Computation and Language · Computer Science 2021-06-15 Peidong Wang , Tara N. Sainath , Ron J. Weiss

Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation

Predicting the altered acoustic frames is an effective way of self-supervised learning for speech representation. However, it is challenging to prevent the pretrained model from overfitting. In this paper, we proposed to introduce two…

Audio and Speech Processing · Electrical Eng. & Systems 2021-07-12 Jian Luo , Jianzong Wang , Ning Cheng , Jing Xiao

Layer-Wise Multi-View Learning for Neural Machine Translation

Traditional neural machine translation is limited to the topmost encoder layer's context representation and cannot directly perceive the lower encoder layers. Existing solutions usually rely on the adjustment of network architecture, making…

Computation and Language · Computer Science 2020-11-04 Qiang Wang , Changliang Li , Yue Zhang , Tong Xiao , Jingbo Zhu