Related papers: An Efficient Transformer Decoder with Compressed S…

Accelerating Neural Transformer via an Average Attention Network

With parallelizable attention networks, the neural Transformer is very fast to train. However, due to the auto-regressive architecture and self-attention in the decoder, the decoding procedure becomes slow. To alleviate this issue, we…

Computation and Language · Computer Science 2018-05-08 Biao Zhang , Deyi Xiong , Jinsong Su

Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism.…

Computation and Language · Computer Science 2023-08-03 Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N. Gomez , Lukasz Kaiser , Illia Polosukhin

Fast Transformer Decoding: One Write-Head is All You Need

Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to…

Neural and Evolutionary Computing · Computer Science 2019-11-07 Noam Shazeer

Fast-FNet: Accelerating Transformer Encoder Models via Efficient Fourier Layers

Transformer-based language models utilize the attention mechanism for substantial performance improvements in almost all natural language processing (NLP) tasks. Similar attention structures are also extensively studied in several other…

Computation and Language · Computer Science 2023-05-17 Nurullah Sevim , Ege Ozan Özyedek , Furkan Şahinuç , Aykut Koç

Weighted Transformer Network for Machine Translation

State-of-the-art results on neural machine translation often use attentional sequence-to-sequence models with some form of convolution or recursion. Vaswani et al. (2017) propose a new architecture that avoids recurrence and convolution…

Artificial Intelligence · Computer Science 2017-11-08 Karim Ahmed , Nitish Shirish Keskar , Richard Socher

Learning Hard Retrieval Decoder Attention for Transformers

The Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily. The multi-head attention network performs the scaled dot-product attention function in parallel, empowering the model by…

Computation and Language · Computer Science 2021-09-13 Hongfei Xu , Qiuhui Liu , Josef van Genabith , Deyi Xiong

Modeling Recurrence for Transformer

Recently, the Transformer model that is based solely on attention mechanisms, has advanced the state-of-the-art on various machine translation tasks. However, recent studies reveal that the lack of recurrence hinders its further improvement…

Computation and Language · Computer Science 2019-04-08 Jie Hao , Xing Wang , Baosong Yang , Longyue Wang , Jinfeng Zhang , Zhaopeng Tu

Accelerating Transformer Decoding via a Hybrid of Self-attention and Recurrent Neural Network

Due to the highly parallelizable architecture, Transformer is faster to train than RNN-based models and popularly used in machine translation tasks. However, at inference time, each output word requires all the hidden states of the…

Computation and Language · Computer Science 2019-09-06 Chengyi Wang , Shuangzhi Wu , Shujie Liu

Sharing Attention Weights for Fast Transformer

Recently, the Transformer machine translation system has shown strong results by stacking attention layers on both the source and target-language sides. But the inference of this model is slow due to the heavy use of dot-product attention…

Computation and Language · Computer Science 2019-06-27 Tong Xiao , Yinqiao Li , Jingbo Zhu , Zhengtao Yu , Tongran Liu

Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction

Current state-of-the-art machine translation systems are based on encoder-decoder architectures, that first encode the input sequence, and then generate an output sequence based on the input encoding. Both are interfaced with an attention…

Computation and Language · Computer Science 2018-11-02 Maha Elbayad , Laurent Besacier , Jakob Verbeek

Representational Strengths and Limitations of Transformers

Attention layers, as commonly used in transformers, form the backbone of modern deep learning, yet there is no mathematical description of their benefits and deficiencies as compared with other architectures. In this work we establish both…

Machine Learning · Computer Science 2023-11-17 Clayton Sanford , Daniel Hsu , Matus Telgarsky

DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering

Transformer-based QA models use input-wide self-attention -- i.e. across both the question and the input passage -- at all layers, causing them to be slow and memory-intensive. It turns out that we can get by without input-wide…

Computation and Language · Computer Science 2020-05-05 Qingqing Cao , Harsh Trivedi , Aruna Balasubramanian , Niranjan Balasubramanian

ReduceFormer: Attention with Tensor Reduction by Summation

Transformers have excelled in many tasks including vision. However, efficient deployment of transformer models in low-latency or high-throughput applications is hindered by the computation in the attention mechanism which involves expensive…

Computer Vision and Pattern Recognition · Computer Science 2024-06-12 John Yang , Le An , Su Inn Park

Condenser: a Pre-training Architecture for Dense Retrieval

Pre-trained Transformer language models (LM) have become go-to text representation encoders. Prior research fine-tunes deep LMs to encode text sequences such as sentences and passages into single dense vector representations for efficient…

Computation and Language · Computer Science 2021-09-22 Luyu Gao , Jamie Callan

Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction

While convolutional neural networks have shown a tremendous impact on various computer vision tasks, they generally demonstrate limitations in explicitly modeling long-range dependencies due to the intrinsic locality of the convolution…

Computer Vision and Pattern Recognition · Computer Science 2021-08-06 Guanglei Yang , Hao Tang , Mingli Ding , Nicu Sebe , Elisa Ricci

Transformers are efficient hierarchical chemical graph learners

Transformers, adapted from natural language processing, are emerging as a leading approach for graph representation learning. Contemporary graph transformers often treat nodes or edges as separate tokens. This approach leads to…

Machine Learning · Computer Science 2023-10-04 Zihan Pengmei , Zimu Li , Chih-chan Tien , Risi Kondor , Aaron R. Dinner

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

Pretrained transformer models have demonstrated remarkable performance across various natural language processing tasks. These models leverage the attention mechanism to capture long- and short-range dependencies in the sequence. However,…

Computation and Language · Computer Science 2023-10-20 Qingru Zhang , Dhananjay Ram , Cole Hawkins , Sheng Zha , Tuo Zhao

Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kernel

Transformer is a powerful architecture that achieves superior performance on various sequence learning tasks, including neural machine translation, language understanding, and sequence prediction. At the core of the Transformer is the…

Machine Learning · Computer Science 2019-11-13 Yao-Hung Hubert Tsai , Shaojie Bai , Makoto Yamada , Louis-Philippe Morency , Ruslan Salakhutdinov

Probing Word Translations in the Transformer and Trading Decoder for Encoder Layers

Due to its effectiveness and performance, the Transformer translation model has attracted wide attention, most recently in terms of probing-based approaches. Previous work focuses on using or probing source linguistic features in the…

Computation and Language · Computer Science 2021-04-21 Hongfei Xu , Josef van Genabith , Qiuhui Liu , Deyi Xiong

Attention-Only Transformers via Unrolled Subspace Denoising

Despite the popularity of transformers in practice, their architectures are empirically designed and neither mathematically justified nor interpretable. Moreover, as indicated by many empirical studies, some components of transformer…

Machine Learning · Computer Science 2025-06-05 Peng Wang , Yifu Lu , Yaodong Yu , Druv Pai , Qing Qu , Yi Ma