Related papers: Flowformer: Linearizing Transformers with Conserva…

Quantifying Attention Flow in Transformers

In the Transformer model, "self-attention" combines information from attended embeddings into the representation of the focal embedding in the next layer. Thus, across layers of the Transformer, information originating from different tokens…

Machine Learning · Computer Science 2020-06-02 Samira Abnar , Willem Zuidema

Linear Log-Normal Attention with Unbiased Concentration

Transformer models have achieved remarkable results in a wide range of applications. However, their scalability is hampered by the quadratic time and memory complexity of the self-attention mechanism concerning the sequence length. This…

Machine Learning · Computer Science 2024-02-27 Yury Nahshan , Joseph Kampeas , Emir Haleva

Rethinking Transformer Connectivity: TLinFormer, A Path to Exact, Full Context-Aware Linear Attention

The Transformer architecture has become a cornerstone of modern artificial intelligence, but its core self-attention mechanism suffers from a complexity bottleneck that scales quadratically with sequence length, severely limiting its…

Machine Learning · Computer Science 2025-08-29 Zhongpan Tang

Transformer Meets Twicing: Harnessing Unattended Residual Information

Transformer-based deep learning models have achieved state-of-the-art performance across numerous language and vision tasks. While the self-attention mechanism, a core component of transformers, has proven capable of handling complex data…

Machine Learning · Computer Science 2025-08-05 Laziz Abdullaev , Tan M. Nguyen

FLatten Transformer: Vision Transformer using Focused Linear Attention

The quadratic computation complexity of self-attention has been a persistent challenge when applying Transformer models to vision tasks. Linear attention, on the other hand, offers a much more efficient alternative with its linear…

Computer Vision and Pattern Recognition · Computer Science 2023-09-04 Dongchen Han , Xuran Pan , Yizeng Han , Shiji Song , Gao Huang

Linformer: Self-Attention with Linear Complexity

Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences,…

Machine Learning · Computer Science 2020-06-16 Sinong Wang , Belinda Z. Li , Madian Khabsa , Han Fang , Hao Ma

Vision Transformers are Circulant Attention Learners

The self-attention mechanism has been a key factor in the advancement of vision Transformers. However, its quadratic complexity imposes a heavy computational burden in high-resolution scenarios, restricting the practical application.…

Computer Vision and Pattern Recognition · Computer Science 2025-12-29 Dongchen Han , Tianyu Li , Ziyi Wang , Gao Huang

Horizontal and Vertical Attention in Transformers

Transformers are built upon multi-head scaled dot-product attention and positional encoding, which aim to learn the feature representations and token dependencies. In this work, we focus on enhancing the distinctive representation by…

Computer Vision and Pattern Recognition · Computer Science 2022-07-12 Litao Yu , Jian Zhang

Evolving Attention with Residual Convolutions

Transformer is a ubiquitous model for natural language processing and has attracted wide attentions in computer vision. The attention maps are indispensable for a transformer model to encode the dependencies among input tokens. However,…

Machine Learning · Computer Science 2021-02-26 Yujing Wang , Yaming Yang , Jiangang Bai , Mingliang Zhang , Jing Bai , Jing Yu , Ce Zhang , Gao Huang , Yunhai Tong

Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization

Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear…

Machine Learning · Computer Science 2022-08-02 Tan Nguyen , Richard G. Baraniuk , Robert M. Kirby , Stanley J. Osher , Bao Wang

On Difficulties of Attention Factorization through Shared Memory

Transformers have revolutionized deep learning in numerous fields, including natural language processing, computer vision, and audio processing. Their strength lies in their attention mechanism, which allows for the discovering of complex…

Machine Learning · Computer Science 2024-04-02 Uladzislau Yorsh , Martin Holeňa , Ondřej Bojar , David Herel

Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems

Transformers and their attention mechanism have been revolutionary in the field of Machine Learning. While originally proposed for the language data, they quickly found their way to the image, video, graph, etc. data modalities with various…

Machine Learning · Computer Science 2025-09-22 Saeed Amizadeh , Sara Abdali , Yinheng Li , Kazuhito Koishida

FAST: Factorizable Attention for Speeding up Transformers

Motivated by the factorization inherent in the original fast multipole method and the improved fast Gauss transform we introduce a factorable form of attention that operates efficiently in high dimensions. This approach reduces the…

Machine Learning · Computer Science 2024-02-13 Armin Gerami , Monte Hoover , Pranav S. Dulepet , Ramani Duraiswami

A Multiscale Visualization of Attention in the Transformer Model

The Transformer is a sequence model that forgoes traditional recurrent architectures in favor of a fully attention-based approach. Besides improving performance, an advantage of using attention is that it can also help to interpret a model…

Human-Computer Interaction · Computer Science 2019-06-14 Jesse Vig

Generative Flows with Invertible Attentions

Flow-based generative models have shown an excellent ability to explicitly learn the probability density function of data via a sequence of invertible transformations. Yet, learning attentions in generative flows remains understudied, while…

Machine Learning · Computer Science 2022-04-01 Rhea Sanjay Sukthanker , Zhiwu Huang , Suryansh Kumar , Radu Timofte , Luc Van Gool

LOTFormer: Doubly-Stochastic Linear Attention via Low-Rank Optimal Transport

Transformers have proven highly effective across modalities, but standard softmax attention scales quadratically with sequence length, limiting long context modeling. Linear attention mitigates this by approximating attention with kernel…

Machine Learning · Computer Science 2026-02-10 Ashkan Shahbazi , Chayne Thrash , Yikun Bai , Keaton Hamm , Navid NaderiAlizadeh , Soheil Kolouri

A Mechanistic Analysis of Transformers for Dynamical Systems

Transformers are increasingly adopted for modeling and forecasting time-series, yet their internal mechanisms remain poorly understood from a dynamical systems perspective. In contrast to classical autoregressive and state-space models,…

Machine Learning · Computer Science 2025-12-25 Gregory Duthé , Nikolaos Evangelou , Wei Liu , Ioannis G. Kevrekidis , Eleni Chatzi

GraphTARIF: Linear Graph Transformer with Augmented Rank and Improved Focus

Linear attention mechanisms have emerged as efficient alternatives to full self-attention in Graph Transformers, offering linear time complexity. However, existing linear attention models often suffer from a significant drop in…

Computer Vision and Pattern Recognition · Computer Science 2026-01-29 Zhaolin Hu , Kun Li , Hehe Fan , Yi Yang

Krause Synchronization Transformers

Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. When composed across depth, this interaction pattern induces strong synchronization dynamics that…

Machine Learning · Computer Science 2026-05-26 Jingkun Liu , Yisong Yue , Max Welling , Yue Song

Two Steps Forward and One Behind: Rethinking Time Series Forecasting with Deep Learning

The Transformer is a highly successful deep learning model that has revolutionised the world of artificial neural networks, first in natural language processing and later in computer vision. This model is based on the attention mechanism…

Machine Learning · Computer Science 2023-05-09 Riccardo Ughi , Eugenio Lomurno , Matteo Matteucci