Related papers: LegoNN: Building Modular Encoder-Decoder Models

Independent language modeling architecture for end-to-end ASR

The attention-based end-to-end (E2E) automatic speech recognition (ASR) architecture allows for joint optimization of acoustic and language models within a single network. However, in a vanilla E2E ASR architecture, the decoder sub-network…

Computation and Language · Computer Science 2019-12-03 Van Tung Pham , Haihua Xu , Yerbolat Khassanov , Zhiping Zeng , Eng Siong Chng , Chongjia Ni , Bin Ma , Haizhou Li

EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding

The performance of automatic speech recognition (ASR) has improved tremendously due to the application of deep neural networks (DNNs). Despite this progress, building a new ASR system remains a challenging task, requiring various resources,…

Computation and Language · Computer Science 2015-10-20 Yajie Miao , Mohammad Gowayyed , Florian Metze

Lego-Features: Exporting modular encoder features for streaming and deliberation ASR

In end-to-end (E2E) speech recognition models, a representational tight-coupling inevitably emerges between the encoder and the decoder. We build upon recent work that has begun to explore building encoders with modular encoded…

Computation and Language · Computer Science 2023-04-04 Rami Botros , Rohit Prabhavalkar , Johan Schalkwyk , Ciprian Chelba , Tara N. Sainath , Françoise Beaufays

LegoSLM: Connecting LLM with Speech Encoder using CTC Posteriors

Recently, large-scale pre-trained speech encoders and Large Language Models (LLMs) have been released, which show state-of-the-art performance on a range of spoken language processing tasks including Automatic Speech Recognition (ASR). To…

Computation and Language · Computer Science 2025-05-19 Rao Ma , Tongzhou Chen , Kartik Audhkhasi , Bhuvana Ramabhadran

Attention guided global enhancement and local refinement network for semantic segmentation

The encoder-decoder architecture is widely used as a lightweight semantic segmentation network. However, it struggles with a limited performance compared to a well-designed Dilated-FCN model for two major problems. First, commonly used…

Computer Vision and Pattern Recognition · Computer Science 2022-05-11 Jiangyun Li , Sen Zha , Chen Chen , Meng Ding , Tianxiang Zhang , Hong Yu

Decoding Partial Differential Equations: Cross-Modal Adaptation of Decoder-only Models to PDEs

While large language models are primarily used on natural language tasks, they have also shown great promise when adapted to new modalities, e.g., for scientific machine learning tasks. Most proposed approaches for such cross-modal…

Machine Learning · Computer Science 2026-03-09 Paloma García-de-Herreros , Philipp Slusallek , Dietrich Klakow , Vagrant Gautam

Integrating Text Inputs For Training and Adapting RNN Transducer ASR Models

Compared to hybrid automatic speech recognition (ASR) systems that use a modular architecture in which each component can be independently adapted to a new domain, recent end-to-end (E2E) ASR system are harder to customize due to their…

Computation and Language · Computer Science 2022-03-01 Samuel Thomas , Brian Kingsbury , George Saon , Hong-Kwang J. Kuo

Echo State Speech Recognition

We propose automatic speech recognition (ASR) models inspired by echo state network (ESN), in which a subset of recurrent neural networks (RNN) layers in the models are randomly initialized and untrained. Our study focuses on RNN-T and…

Computation and Language · Computer Science 2021-02-19 Harsh Shrivastava , Ankush Garg , Yuan Cao , Yu Zhang , Tara Sainath

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes…

Computation and Language · Computer Science 2014-09-04 Kyunghyun Cho , Bart van Merrienboer , Caglar Gulcehre , Dzmitry Bahdanau , Fethi Bougares , Holger Schwenk , Yoshua Bengio

Improving Variational Encoder-Decoders in Dialogue Generation

Variational encoder-decoders (VEDs) have shown promising results in dialogue generation. However, the latent variable distributions are usually approximated by a much simpler model than the powerful RNN structure used for encoding and…

Computation and Language · Computer Science 2018-02-07 Xiaoyu Shen , Hui Su , Shuzi Niu , Vera Demberg

Local Monotonic Attention Mechanism for End-to-End Speech and Language Processing

Recently, encoder-decoder neural networks have shown impressive performance on many sequence-related tasks. The architecture commonly uses an attentional mechanism which allows the model to learn alignments between the source and the target…

Computation and Language · Computer Science 2017-11-06 Andros Tjandra , Sakriani Sakti , Satoshi Nakamura

Optimizing Alignment of Speech and Language Latent Spaces for End-to-End Speech Recognition and Understanding

The advances in attention-based encoder-decoder (AED) networks have brought great progress to end-to-end (E2E) automatic speech recognition (ASR). One way to further improve the performance of AED-based E2E ASR is to introduce an extra text…

Sound · Computer Science 2021-10-26 Wei Wang , Shuo Ren , Yao Qian , Shujie Liu , Yu Shi , Yanmin Qian , Michael Zeng

Natural Language Generation for Spoken Dialogue System using RNN Encoder-Decoder Networks

Natural language generation (NLG) is a critical component in a spoken dialogue system. This paper presents a Recurrent Neural Network based Encoder-Decoder architecture, in which an LSTM-based decoder is introduced to select, aggregate…

Computation and Language · Computer Science 2017-08-16 Van-Khanh Tran , Le-Minh Nguyen

Coupling Speech Encoders with Downstream Text Models

We present a modular approach to building cascade speech translation (AST) models that guarantees that the resulting model performs no worse than the 1-best cascade baseline while preserving state-of-the-art speech recognition (ASR) and…

Computation and Language · Computer Science 2024-07-26 Ciprian Chelba , Johan Schalkwyk

Return of the Encoder: Maximizing Parameter Efficiency for SLMs

The dominance of large decoder-only language models has overshadowed encoder-decoder architectures, despite their fundamental efficiency advantages in sequence processing. For small language models (SLMs) - those with 1 billion parameters…

Computation and Language · Computer Science 2025-01-31 Mohamed Elfeki , Rui Liu , Chad Voegele

Decoupled Structure for Improved Adaptability of End-to-End Models

Although end-to-end (E2E) trainable automatic speech recognition (ASR) has shown great success by jointly learning acoustic and linguistic information, it still suffers from the effect of domain shifts, thus limiting potential applications.…

Audio and Speech Processing · Electrical Eng. & Systems 2023-08-28 Keqi Deng , Philip C. Woodland

Collaborative Deep Learning for Speech Enhancement: A Run-Time Model Selection Method Using Autoencoders

We show that a Modular Neural Network (MNN) can combine various speech enhancement modules, each of which is a Deep Neural Network (DNN) specialized on a particular enhancement job. Differently from an ordinary ensemble technique that…

Sound · Computer Science 2017-05-31 Minje Kim

Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech Separation

In speech separation, time-domain approaches have successfully replaced the time-frequency domain with latent sequence feature from a learnable encoder. Conventionally, the feature is separated into speaker-specific ones at the final stage…

Audio and Speech Processing · Electrical Eng. & Systems 2026-04-01 Ui-Hyeop Shin , Sangyoun Lee , Taehan Kim , Hyung-Min Park

Language Models are Good Translators

Recent years have witnessed the rapid advance in neural machine translation (NMT), the core of which lies in the encoder-decoder architecture. Inspired by the recent progress of large-scale pre-trained language models on machine translation…

Computation and Language · Computer Science 2021-06-28 Shuo Wang , Zhaopeng Tu , Zhixing Tan , Wenxuan Wang , Maosong Sun , Yang Liu

Is Encoder-Decoder Redundant for Neural Machine Translation?

Encoder-decoder architecture is widely adopted for sequence-to-sequence modeling tasks. For machine translation, despite the evolution from long short-term memory networks to Transformer networks, plus the introduction and development of…

Computation and Language · Computer Science 2022-10-24 Yingbo Gao , Christian Herold , Zijian Yang , Hermann Ney