Related papers: Quantifying Attention Flow in Transformers

On Identifiability in Transformers

In this paper we delve deep in the Transformer architecture by investigating two of its core components: self-attention and contextual embeddings. In particular, we study the identifiability of attention weights and token embeddings, and…

Computation and Language · Computer Science 2020-02-10 Gino Brunner , Yang Liu , Damián Pascual , Oliver Richter , Massimiliano Ciaramita , Roger Wattenhofer

Measuring the Mixing of Contextual Information in the Transformer

The Transformer architecture aggregates input information through the self-attention mechanism, but there is no clear understanding of how this information is mixed across the entire model. Additionally, recent works have demonstrated that…

Computation and Language · Computer Science 2022-10-25 Javier Ferrando , Gerard I. Gállego , Marta R. Costa-jussà

Flowformer: Linearizing Transformers with Conservation Flows

Transformers based on the attention mechanism have achieved impressive success in various areas. However, the attention mechanism has a quadratic complexity, significantly impeding Transformers from dealing with numerous tokens and scaling…

Machine Learning · Computer Science 2022-06-17 Haixu Wu , Jialong Wu , Jiehui Xu , Jianmin Wang , Mingsheng Long

Self-Attention Attribution: Interpreting Information Interactions Inside Transformer

The great success of Transformer-based models benefits from the powerful multi-head self-attention mechanism, which learns token dependencies and encodes contextual information from the input. Prior work strives to attribute model decisions…

Computation and Language · Computer Science 2021-02-26 Yaru Hao , Li Dong , Furu Wei , Ke Xu

Attention Flows for General Transformers

In this paper, we study the computation of how much an input token in a Transformer model influences its prediction. We formalize a method to construct a flow network out of the attention values of encoder-only Transformer models and extend…

Machine Learning · Computer Science 2022-06-01 Niklas Metzger , Christopher Hahn , Julian Siber , Frederik Schmitt , Bernd Finkbeiner

Horizontal and Vertical Attention in Transformers

Transformers are built upon multi-head scaled dot-product attention and positional encoding, which aim to learn the feature representations and token dependencies. In this work, we focus on enhancing the distinctive representation by…

Computer Vision and Pattern Recognition · Computer Science 2022-07-12 Litao Yu , Jian Zhang

ATTENTION2D: Communication Efficient Distributed Self-Attention Mechanism

Transformer-based models have emerged as a leading architecture for natural language processing, natural language generation, and image generation tasks. A fundamental element of the transformer architecture is self-attention, which allows…

Machine Learning · Computer Science 2025-07-01 Venmugil Elango

AttentionViz: A Global View of Transformer Attention

Transformer models are revolutionizing machine learning, but their inner workings remain mysterious. In this work, we present a new visualization technique designed to help researchers understand the self-attention mechanism in transformers…

Human-Computer Interaction · Computer Science 2023-08-10 Catherine Yeh , Yida Chen , Aoyu Wu , Cynthia Chen , Fernanda Viégas , Martin Wattenberg

A Multiscale Visualization of Attention in the Transformer Model

The Transformer is a sequence model that forgoes traditional recurrent architectures in favor of a fully attention-based approach. Besides improving performance, an advantage of using attention is that it can also help to interpret a model…

Human-Computer Interaction · Computer Science 2019-06-14 Jesse Vig

Multiformer: A Head-Configurable Transformer-Based Model for Direct Speech Translation

Transformer-based models have been achieving state-of-the-art results in several fields of Natural Language Processing. However, its direct application to speech tasks is not trivial. The nature of this sequences carries problems such as…

Computation and Language · Computer Science 2022-05-17 Gerard Sant , Gerard I. Gállego , Belen Alastruey , Marta R. Costa-Jussà

Linear Log-Normal Attention with Unbiased Concentration

Transformer models have achieved remarkable results in a wide range of applications. However, their scalability is hampered by the quadratic time and memory complexity of the self-attention mechanism concerning the sequence length. This…

Machine Learning · Computer Science 2024-02-27 Yury Nahshan , Joseph Kampeas , Emir Haleva

Evolving Attention with Residual Convolutions

Transformer is a ubiquitous model for natural language processing and has attracted wide attentions in computer vision. The attention maps are indispensable for a transformer model to encode the dependencies among input tokens. However,…

Machine Learning · Computer Science 2021-02-26 Yujing Wang , Yaming Yang , Jiangang Bai , Mingliang Zhang , Jing Bai , Jing Yu , Ce Zhang , Gao Huang , Yunhai Tong

Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers

Transformers are increasingly dominating multi-modal reasoning tasks, such as visual question answering, achieving state-of-the-art results thanks to their ability to contextualize information using the self-attention and co-attention…

Computer Vision and Pattern Recognition · Computer Science 2021-03-30 Hila Chefer , Shir Gur , Lior Wolf

Neural Attention: A Novel Mechanism for Enhanced Expressive Power in Transformer Models

Transformer models typically calculate attention matrices using dot products, which have limitations when capturing nonlinear relationships between embedding vectors. We propose Neural Attention, a technique that replaces dot products with…

Machine Learning · Computer Science 2025-11-10 Andrew DiGiugno , Ausif Mahmood

Centroid Transformers: Learning to Abstract with Attention

Self-attention, as the key block of transformers, is a powerful mechanism for extracting features from the inputs. In essence, what self-attention does is to infer the pairwise relations between the elements of the inputs, and modify the…

Machine Learning · Computer Science 2021-03-09 Lemeng Wu , Xingchao Liu , Qiang Liu

Unveiling and Controlling Anomalous Attention Distribution in Transformers

With the advent of large models based on the Transformer architecture, researchers have observed an anomalous phenomenon in the Attention mechanism--there is a very high attention on the first element, which is prevalent across…

Machine Learning · Computer Science 2024-07-04 Ruiqing Yan , Xingbo Du , Haoyu Deng , Linghan Zheng , Qiuzhuang Sun , Jifang Hu , Yuhang Shao , Penghao Jiang , Jinrong Jiang , Lian Zhao

Understanding Transformers and Attention Mechanisms: An Introduction for Applied Mathematicians

This document provides a brief introduction to the attention mechanism used in modern language models based on the Transformer architecture. We first illustrate how text is encoded as vectors and how the attention mechanism processes these…

Numerical Analysis · Mathematics 2026-04-02 Michel Fabrice Serret

Poly-attention: a general scheme for higher-order self-attention

The self-attention mechanism, at the heart of the Transformer model, is able to effectively model pairwise interactions between tokens. However, numerous recent works have shown that it is unable to perform basic tasks involving detecting…

Machine Learning · Computer Science 2026-02-03 Sayak Chakrabarti , Toniann Pitassi , Josh Alman

Fair Comparison between Efficient Attentions

Transformers have been successfully used in various fields and are becoming the standard tools in computer vision. However, self-attention, a core component of transformers, has a quadratic complexity problem, which limits the use of…

Computer Vision and Pattern Recognition · Computer Science 2022-06-02 Jiuk Hong , Chaehyeon Lee , Soyoun Bang , Heechul Jung

Attention Weights in Transformer NMT Fail Aligning Words Between Sequences but Largely Explain Model Predictions

This work proposes an extensive analysis of the Transformer architecture in the Neural Machine Translation (NMT) setting. Focusing on the encoder-decoder attention mechanism, we prove that attention weights systematically make alignment…

Computation and Language · Computer Science 2021-09-14 Javier Ferrando , Marta R. Costa-jussà