English
Related papers

Related papers: LegoNN: Building Modular Encoder-Decoder Models

200 papers

The attention-based end-to-end (E2E) automatic speech recognition (ASR) architecture allows for joint optimization of acoustic and language models within a single network. However, in a vanilla E2E ASR architecture, the decoder sub-network…

Computation and Language · Computer Science 2019-12-03 Van Tung Pham , Haihua Xu , Yerbolat Khassanov , Zhiping Zeng , Eng Siong Chng , Chongjia Ni , Bin Ma , Haizhou Li

The performance of automatic speech recognition (ASR) has improved tremendously due to the application of deep neural networks (DNNs). Despite this progress, building a new ASR system remains a challenging task, requiring various resources,…

Computation and Language · Computer Science 2015-10-20 Yajie Miao , Mohammad Gowayyed , Florian Metze

In end-to-end (E2E) speech recognition models, a representational tight-coupling inevitably emerges between the encoder and the decoder. We build upon recent work that has begun to explore building encoders with modular encoded…

Computation and Language · Computer Science 2023-04-04 Rami Botros , Rohit Prabhavalkar , Johan Schalkwyk , Ciprian Chelba , Tara N. Sainath , Françoise Beaufays

Recently, large-scale pre-trained speech encoders and Large Language Models (LLMs) have been released, which show state-of-the-art performance on a range of spoken language processing tasks including Automatic Speech Recognition (ASR). To…

Computation and Language · Computer Science 2025-05-19 Rao Ma , Tongzhou Chen , Kartik Audhkhasi , Bhuvana Ramabhadran

The encoder-decoder architecture is widely used as a lightweight semantic segmentation network. However, it struggles with a limited performance compared to a well-designed Dilated-FCN model for two major problems. First, commonly used…

Computer Vision and Pattern Recognition · Computer Science 2022-05-11 Jiangyun Li , Sen Zha , Chen Chen , Meng Ding , Tianxiang Zhang , Hong Yu

While large language models are primarily used on natural language tasks, they have also shown great promise when adapted to new modalities, e.g., for scientific machine learning tasks. Most proposed approaches for such cross-modal…

Machine Learning · Computer Science 2026-03-09 Paloma García-de-Herreros , Philipp Slusallek , Dietrich Klakow , Vagrant Gautam

Compared to hybrid automatic speech recognition (ASR) systems that use a modular architecture in which each component can be independently adapted to a new domain, recent end-to-end (E2E) ASR system are harder to customize due to their…

Computation and Language · Computer Science 2022-03-01 Samuel Thomas , Brian Kingsbury , George Saon , Hong-Kwang J. Kuo

We propose automatic speech recognition (ASR) models inspired by echo state network (ESN), in which a subset of recurrent neural networks (RNN) layers in the models are randomly initialized and untrained. Our study focuses on RNN-T and…

Computation and Language · Computer Science 2021-02-19 Harsh Shrivastava , Ankush Garg , Yuan Cao , Yu Zhang , Tara Sainath

In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes…

Computation and Language · Computer Science 2014-09-04 Kyunghyun Cho , Bart van Merrienboer , Caglar Gulcehre , Dzmitry Bahdanau , Fethi Bougares , Holger Schwenk , Yoshua Bengio

Variational encoder-decoders (VEDs) have shown promising results in dialogue generation. However, the latent variable distributions are usually approximated by a much simpler model than the powerful RNN structure used for encoding and…

Computation and Language · Computer Science 2018-02-07 Xiaoyu Shen , Hui Su , Shuzi Niu , Vera Demberg

Recently, encoder-decoder neural networks have shown impressive performance on many sequence-related tasks. The architecture commonly uses an attentional mechanism which allows the model to learn alignments between the source and the target…

Computation and Language · Computer Science 2017-11-06 Andros Tjandra , Sakriani Sakti , Satoshi Nakamura

The advances in attention-based encoder-decoder (AED) networks have brought great progress to end-to-end (E2E) automatic speech recognition (ASR). One way to further improve the performance of AED-based E2E ASR is to introduce an extra text…

Sound · Computer Science 2021-10-26 Wei Wang , Shuo Ren , Yao Qian , Shujie Liu , Yu Shi , Yanmin Qian , Michael Zeng

Natural language generation (NLG) is a critical component in a spoken dialogue system. This paper presents a Recurrent Neural Network based Encoder-Decoder architecture, in which an LSTM-based decoder is introduced to select, aggregate…

Computation and Language · Computer Science 2017-08-16 Van-Khanh Tran , Le-Minh Nguyen

We present a modular approach to building cascade speech translation (AST) models that guarantees that the resulting model performs no worse than the 1-best cascade baseline while preserving state-of-the-art speech recognition (ASR) and…

Computation and Language · Computer Science 2024-07-26 Ciprian Chelba , Johan Schalkwyk

The dominance of large decoder-only language models has overshadowed encoder-decoder architectures, despite their fundamental efficiency advantages in sequence processing. For small language models (SLMs) - those with 1 billion parameters…

Computation and Language · Computer Science 2025-01-31 Mohamed Elfeki , Rui Liu , Chad Voegele

Although end-to-end (E2E) trainable automatic speech recognition (ASR) has shown great success by jointly learning acoustic and linguistic information, it still suffers from the effect of domain shifts, thus limiting potential applications.…

Audio and Speech Processing · Electrical Eng. & Systems 2023-08-28 Keqi Deng , Philip C. Woodland

We show that a Modular Neural Network (MNN) can combine various speech enhancement modules, each of which is a Deep Neural Network (DNN) specialized on a particular enhancement job. Differently from an ordinary ensemble technique that…

Sound · Computer Science 2017-05-31 Minje Kim

In speech separation, time-domain approaches have successfully replaced the time-frequency domain with latent sequence feature from a learnable encoder. Conventionally, the feature is separated into speaker-specific ones at the final stage…

Audio and Speech Processing · Electrical Eng. & Systems 2026-04-01 Ui-Hyeop Shin , Sangyoun Lee , Taehan Kim , Hyung-Min Park

Recent years have witnessed the rapid advance in neural machine translation (NMT), the core of which lies in the encoder-decoder architecture. Inspired by the recent progress of large-scale pre-trained language models on machine translation…

Computation and Language · Computer Science 2021-06-28 Shuo Wang , Zhaopeng Tu , Zhixing Tan , Wenxuan Wang , Maosong Sun , Yang Liu

Encoder-decoder architecture is widely adopted for sequence-to-sequence modeling tasks. For machine translation, despite the evolution from long short-term memory networks to Transformer networks, plus the introduction and development of…

Computation and Language · Computer Science 2022-10-24 Yingbo Gao , Christian Herold , Zijian Yang , Hermann Ney
‹ Prev 1 2 3 10 Next ›