Related papers: Pay Attention when Required

Transformer++

Recent advancements in attention mechanisms have replaced recurrent neural networks and its variants for machine translation tasks. Transformer using attention mechanism solely achieved state-of-the-art results in sequence modeling. Neural…

Computation and Language · Computer Science 2020-04-02 Prakhar Thapak , Prodip Hore

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a…

Machine Learning · Computer Science 2019-06-04 Zihang Dai , Zhilin Yang , Yiming Yang , Jaime Carbonell , Quoc V. Le , Ruslan Salakhutdinov

A Multiscale Visualization of Attention in the Transformer Model

The Transformer is a sequence model that forgoes traditional recurrent architectures in favor of a fully attention-based approach. Besides improving performance, an advantage of using attention is that it can also help to interpret a model…

Human-Computer Interaction · Computer Science 2019-06-14 Jesse Vig

GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures

Attention based language models have become a critical component in state-of-the-art natural language processing systems. However, these models have significant computational requirements, due to long training times, dense operations and…

Computation and Language · Computer Science 2021-06-11 Ivan Chelombiev , Daniel Justus , Douglas Orr , Anastasia Dietrich , Frithjof Gressmann , Alexandros Koliousis , Carlo Luschi

Efficient Language Modeling for Low-Resource Settings with Hybrid RNN-Transformer Architectures

Transformer-based language models have recently been at the forefront of active research in text generation. However, these models' advances come at the price of prohibitive training costs, with parameter counts in the billions and compute…

Computation and Language · Computer Science 2025-02-04 Gabriel Lindenmaier , Sean Papay , Sebastian Padó

Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps

Transformers are ubiquitous in wide tasks. Interpreting their internals is a pivotal goal. Nevertheless, their particular components, feed-forward (FF) blocks, have typically been less analyzed despite their substantial parameter amounts.…

Computation and Language · Computer Science 2024-04-16 Goro Kobayashi , Tatsuki Kuribayashi , Sho Yokoi , Kentaro Inui

Layer-wise Pruning of Transformer Attention Heads for Efficient Language Modeling

While Transformer-based models have shown impressive language modeling performance, the large computation cost is often prohibitive for practical use. Attention head pruning, which removes unnecessary attention heads in the multihead…

Computation and Language · Computer Science 2021-10-08 Kyuhong Shim , Iksoo Choi , Wonyong Sung , Jungwook Choi

Fastformer: Additive Attention Can Be All You Need

Transformer is a powerful model for text understanding. However, it is inefficient due to its quadratic complexity to input sequence length. Although there are many methods on Transformer acceleration, they are still either inefficient on…

Computation and Language · Computer Science 2021-09-07 Chuhan Wu , Fangzhao Wu , Tao Qi , Yongfeng Huang , Xing Xie

Temporal Attention for Language Models

Pretrained language models based on the transformer architecture have shown great success in NLP. Textual training data often comes from the web and is thus tagged with time-specific information, but most language models ignore this…

Computation and Language · Computer Science 2022-05-05 Guy D. Rosin , Kira Radinsky

Augmenting Self-attention with Persistent Memory

Transformer networks have lead to important progress in language modeling and machine translation. These models include two consecutive modules, a feed-forward layer and a self-attention layer. The latter allows the network to capture long…

Machine Learning · Computer Science 2019-07-03 Sainbayar Sukhbaatar , Edouard Grave , Guillaume Lample , Herve Jegou , Armand Joulin

Does Self-Attention Need Separate Weights in Transformers?

The success of self-attention lies in its ability to capture long-range dependencies and enhance context understanding, but it is limited by its computational complexity and challenges in handling sequential data with inherent…

Computation and Language · Computer Science 2025-05-05 Md Kowsher , Nusrat Jahan Prottasha , Chun-Nam Yu , Ozlem Ozmen Garibay , Niloofar Yousefi

Self-Attention Attribution: Interpreting Information Interactions Inside Transformer

The great success of Transformer-based models benefits from the powerful multi-head self-attention mechanism, which learns token dependencies and encodes contextual information from the input. Prior work strives to attribute model decisions…

Computation and Language · Computer Science 2021-02-26 Yaru Hao , Li Dong , Furu Wei , Ke Xu

Block Transformer: Global-to-Local Language Modeling for Fast Inference

We introduce the Block Transformer which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks associated with self-attention. Self-attention requires the key-value (KV) cache of…

Computation and Language · Computer Science 2024-11-04 Namgyu Ho , Sangmin Bae , Taehyeon Kim , Hyunjik Jo , Yireun Kim , Tal Schuster , Adam Fisch , James Thorne , Se-Young Yun

Efficient Content-Based Sparse Attention with Routing Transformers

Self-attention has recently been adopted for a wide range of sequence modeling problems. Despite its effectiveness, self-attention suffers from quadratic compute and memory requirements with respect to sequence length. Successful approaches…

Machine Learning · Computer Science 2020-10-27 Aurko Roy , Mohammad Saffar , Ashish Vaswani , David Grangier

Multiformer: A Head-Configurable Transformer-Based Model for Direct Speech Translation

Transformer-based models have been achieving state-of-the-art results in several fields of Natural Language Processing. However, its direct application to speech tasks is not trivial. The nature of this sequences carries problems such as…

Computation and Language · Computer Science 2022-05-17 Gerard Sant , Gerard I. Gállego , Belen Alastruey , Marta R. Costa-Jussà

Transition-based Parsing with Stack-Transformers

Modeling the parser state is key to good performance in transition-based parsing. Recurrent Neural Networks considerably improved the performance of transition-based systems by modelling the global state, e.g. stack-LSTM parsers, or local…

Computation and Language · Computer Science 2020-10-22 Ramon Fernandez Astudillo , Miguel Ballesteros , Tahira Naseem , Austin Blodgett , Radu Florian

A Transformer with Stack Attention

Natural languages are believed to be (mildly) context-sensitive. Despite underpinning remarkably capable large language models, transformers are unable to model many context-free language tasks. In an attempt to address this limitation in…

Computation and Language · Computer Science 2024-05-15 Jiaoda Li , Jennifer C. White , Mrinmaya Sachan , Ryan Cotterell

Input-length-shortening and text generation via attention values

Identifying words that impact a task's performance more than others is a challenge in natural language processing. Transformers models have recently addressed this issue by incorporating an attention mechanism that assigns greater attention…

Computation and Language · Computer Science 2023-03-15 Neşet Özkan Tan , Alex Yuxuan Peng , Joshua Bensemann , Qiming Bao , Tim Hartill , Mark Gahegan , Michael Witbrock

How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices. We find that this mechanism, while powerful and elegant, is not as…

Computation and Language · Computer Science 2022-11-08 Michael Hassid , Hao Peng , Daniel Rotem , Jungo Kasai , Ivan Montero , Noah A. Smith , Roy Schwartz

Horizontal and Vertical Attention in Transformers

Transformers are built upon multi-head scaled dot-product attention and positional encoding, which aim to learn the feature representations and token dependencies. In this work, we focus on enhancing the distinctive representation by…

Computer Vision and Pattern Recognition · Computer Science 2022-07-12 Litao Yu , Jian Zhang