Related papers: Multi-Stream Transformers

Rethinking Encoder-Decoder Flow Through Shared Structures

Dense prediction tasks have enjoyed a growing complexity of encoder architectures, decoders, however, have remained largely the same. They rely on individual blocks decoding intermediate feature maps sequentially. We introduce banks, shared…

Computer Vision and Pattern Recognition · Computer Science 2025-01-27 Frederik Laboyrie , Mehmet Kerim Yucel , Albert Saa-Garriga

On the Sub-Layer Functionalities of Transformer Decoder

There have been significant efforts to interpret the encoder of Transformer-based encoder-decoder architectures for neural machine translation (NMT); meanwhile, the decoder remains largely unexamined despite its critical role. During…

Computation and Language · Computer Science 2020-10-07 Yilin Yang , Longyue Wang , Shuming Shi , Prasad Tadepalli , Stefan Lee , Zhaopeng Tu

Balancing Cost and Benefit with Tied-Multi Transformers

We propose and evaluate a novel procedure for training multiple Transformers with tied parameters which compresses multiple models into one enabling the dynamic choice of the number of encoder and decoder layers during decoding. In…

Computation and Language · Computer Science 2020-02-21 Raj Dabre , Raphael Rubino , Atsushi Fujita

Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition

Stream fusion, also known as system combination, is a common technique in automatic speech recognition for traditional hybrid hidden Markov model approaches, yet mostly unexplored for modern deep neural network end-to-end model…

Audio and Speech Processing · Electrical Eng. & Systems 2021-07-15 Timo Lohrenz , Zhengyang Li , Tim Fingscheidt

Input Combination Strategies for Multi-Source Transformer Decoder

In multi-source sequence-to-sequence tasks, the attention mechanism can be modeled in several ways. This topic has been thoroughly studied on recurrent architectures. In this paper, we extend the previous work to the encoder-decoder…

Computation and Language · Computer Science 2018-11-13 Jindřich Libovický , Jindřich Helcl , David Mareček

Multi-Pass Transformer for Machine Translation

In contrast with previous approaches where information flows only towards deeper layers of a stack, we consider a multi-pass transformer (MPT) architecture in which earlier layers are allowed to process information in light of the output of…

Computation and Language · Computer Science 2020-09-25 Peng Gao , Chiori Hori , Shijie Geng , Takaaki Hori , Jonathan Le Roux

Probing Word Translations in the Transformer and Trading Decoder for Encoder Layers

Due to its effectiveness and performance, the Transformer translation model has attracted wide attention, most recently in terms of probing-based approaches. Previous work focuses on using or probing source linguistic features in the…

Computation and Language · Computer Science 2021-04-21 Hongfei Xu , Josef van Genabith , Qiuhui Liu , Deyi Xiong

End-to-End Multi-Channel Transformer for Speech Recognition

Transformers are powerful neural architectures that allow integrating different modalities using attention mechanisms. In this paper, we leverage the neural transformer architectures for multi-channel speech recognition systems, where the…

Audio and Speech Processing · Electrical Eng. & Systems 2021-02-09 Feng-Ju Chang , Martin Radfar , Athanasios Mouchtaris , Brian King , Siegfried Kunzmann

DecompX: Explaining Transformers Decisions by Propagating Token Decomposition

An emerging solution for explaining Transformer-based models is to use vector-based analysis on how the representations are formed. However, providing a faithful vector-based explanation for a multi-layer model could be challenging in three…

Computation and Language · Computer Science 2023-06-06 Ali Modarressi , Mohsen Fayyaz , Ehsan Aghazadeh , Yadollah Yaghoobzadeh , Mohammad Taher Pilehvar

Multi-Head Decoder for End-to-End Speech Recognition

This paper presents a new network architecture called multi-head decoder for end-to-end speech recognition as an extension of a multi-head attention model. In the multi-head attention model, multiple attentions are calculated, and then,…

Computation and Language · Computer Science 2018-07-31 Tomoki Hayashi , Shinji Watanabe , Tomoki Toda , Kazuya Takeda

The Dual-Stream Transformer: Channelized Architecture for Interpretable Language Modeling

Standard transformers entangle all computation in a single residual stream, obscuring which components perform which functions. We introduce the Dual-Stream Transformer, which decomposes the residual stream into two functionally distinct…

Computation and Language · Computer Science 2026-03-10 J. Clayton Kerce , Alexis Fox

Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset

Recently, Transformer based end-to-end models have achieved great success in many areas including speech recognition. However, compared to LSTM models, the heavy computational cost of the Transformer during inference is a key issue to…

Computation and Language · Computer Science 2021-03-02 Xie Chen , Yu Wu , Zhenghao Wang , Shujie Liu , Jinyu Li

Hybrid Decoding: Rapid Pass and Selective Detailed Correction for Sequence Models

Recently, Transformer-based encoder-decoder models have demonstrated strong performance in multilingual speech recognition. However, the decoder's autoregressive nature and large size introduce significant bottlenecks during inference.…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-28 Yunkyu Lim , Jihwan Park , Hyung Yong Kim , Hanbin Lee , Byeong-Yeol Kim

Generating Diverse Translation by Manipulating Multi-Head Attention

Transformer model has been widely used on machine translation tasks and obtained state-of-the-art results. In this paper, we report an interesting phenomenon in its encoder-decoder multi-head attention: different attention heads of the…

Computation and Language · Computer Science 2019-11-22 Zewei Sun , Shujian Huang , Hao-Ran Wei , Xin-yu Dai , Jiajun Chen

Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition

Recent advances of end-to-end models have outperformed conventional models through employing a two-pass model. The two-pass model provides better speed-quality trade-offs for on-device speech recognition, where a 1st-pass model generates…

Audio and Speech Processing · Electrical Eng. & Systems 2020-09-24 Wei Li , James Qin , Chung-Cheng Chiu , Ruoming Pang , Yanzhang He

StagFormer: Time Staggering Transformer Decoding for RunningLayers In Parallel

Decoding in a Transformer based language model is inherently sequential as a token's embedding needs to pass through all the layers in the network before the generation of the next token can begin. In this work, we propose a new…

Machine Learning · Computer Science 2025-08-27 Dylan Cutler , Arun Kandoor , Nishanth Dikkala , Nikunj Saunshi , Xin Wang , Rina Panigrahy

Transformer Transducer: One Model Unifying Streaming and Non-streaming Speech Recognition

In this paper we present a Transformer-Transducer model architecture and a training technique to unify streaming and non-streaming speech recognition models into one model. The model is composed of a stack of transformer layers for audio…

Sound · Computer Science 2020-10-08 Anshuman Tripathi , Jaeyoung Kim , Qian Zhang , Han Lu , Hasim Sak

Multilingual Machine Translation: Closing the Gap between Shared and Language-specific Encoder-Decoders

State-of-the-art multilingual machine translation relies on a universal encoder-decoder, which requires retraining the entire system to add new languages. In this paper, we propose an alternative approach that is based on language-specific…

Computation and Language · Computer Science 2020-04-15 Carlos Escolano , Marta R. Costa-jussà , José A. R. Fonollosa , Mikel Artetxe

Transformers as Multi-task Learners: Decoupling Features in Hidden Markov Models

Transformer based models have shown remarkable capabilities in sequence learning across a wide range of tasks, often performing well on specific task by leveraging input-output examples. Despite their empirical success, a comprehensive…

Machine Learning · Computer Science 2025-06-03 Yifan Hao , Chenlu Ye , Chi Han , Tong Zhang

Structured Multidimensional Representation Learning for Large Language Models

Transformer architectures achieve state-of-the-art performance across a wide range of pattern recognition and natural language processing tasks, but their scaling is accompanied by substantial parameter growth and redundancy in the…

Computation and Language · Computer Science 2026-03-09 Alaa El Ichi , Khalide Jbilou , Mohamed El Guide , Franck Dufrenois