Related papers: Krause Synchronization Transformers

Flowformer: Linearizing Transformers with Conservation Flows

Transformers based on the attention mechanism have achieved impressive success in various areas. However, the attention mechanism has a quadratic complexity, significantly impeding Transformers from dealing with numerous tokens and scaling…

Machine Learning · Computer Science 2022-06-17 Haixu Wu , Jialong Wu , Jiehui Xu , Jianmin Wang , Mingsheng Long

Selective Attention: Enhancing Transformer through Principled Context Control

The attention mechanism within the transformer architecture enables the model to weigh and combine tokens based on their relevance to the query. While self-attention has enjoyed major success, it notably treats all queries $q$ in the same…

Machine Learning · Computer Science 2024-11-21 Xuechen Zhang , Xiangyu Chang , Mingchen Li , Amit Roy-Chowdhury , Jiasi Chen , Samet Oymak

Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention

Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization. While effective in many settings, this constraint can limit flexibility in controlling attention…

Computation and Language · Computer Science 2026-02-27 Jeongin Bae , Baeseong Park , Gunho Park , Minsub Kim , Joonhyung Lee , Junhee Yoo , Sunghyeon Woo , Jiwon Ryu , Se Jung Kwon , Dongsoo Lee

Efficient Content-Based Sparse Attention with Routing Transformers

Self-attention has recently been adopted for a wide range of sequence modeling problems. Despite its effectiveness, self-attention suffers from quadratic compute and memory requirements with respect to sequence length. Successful approaches…

Machine Learning · Computer Science 2020-10-27 Aurko Roy , Mohammad Saffar , Ashish Vaswani , David Grangier

Linear Log-Normal Attention with Unbiased Concentration

Transformer models have achieved remarkable results in a wide range of applications. However, their scalability is hampered by the quadratic time and memory complexity of the self-attention mechanism concerning the sequence length. This…

Machine Learning · Computer Science 2024-02-27 Yury Nahshan , Joseph Kampeas , Emir Haleva

Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection

Self-attention based Transformer has demonstrated the state-of-the-art performances in a number of natural language processing tasks. Self-attention is able to model long-term dependencies, but it may suffer from the extraction of…

Computation and Language · Computer Science 2019-12-30 Guangxiang Zhao , Junyang Lin , Zhiyuan Zhang , Xuancheng Ren , Qi Su , Xu Sun

Focal Self-attention for Local-Global Interactions in Vision Transformers

Recently, Vision Transformer and its variants have shown great promise on various computer vision tasks. The ability of capturing short- and long-range visual dependencies through self-attention is arguably the main source for the success.…

Computer Vision and Pattern Recognition · Computer Science 2021-07-02 Jianwei Yang , Chunyuan Li , Pengchuan Zhang , Xiyang Dai , Bin Xiao , Lu Yuan , Jianfeng Gao

Relaxed Attention for Transformer Models

The powerful modeling capabilities of all-attention-based transformer architectures often cause overfitting and - for natural language processing tasks - lead to an implicitly learned internal language model in the autoregressive…

Machine Learning · Computer Science 2022-09-21 Timo Lohrenz , Björn Möller , Zhengyang Li , Tim Fingscheidt

Stabilizing Transformer Training Through Consensus

Standard attention-based transformers are known to exhibit instability under learning rate overspecification during training, particularly at high learning rates. While various methods have been proposed to improve resilience to such…

Machine Learning · Computer Science 2026-02-02 Shyam Venkatasubramanian , Sean Moushegian , Michael Lin , Mir Park , Ankit Singhal , Connor Lee

Inductive Biases and Variable Creation in Self-Attention Mechanisms

Self-attention, an architectural motif designed to model long-range interactions in sequential data, has driven numerous recent breakthroughs in natural language processing and beyond. This work provides a theoretical analysis of the…

Machine Learning · Computer Science 2022-06-27 Benjamin L. Edelman , Surbhi Goel , Sham Kakade , Cyril Zhang

ATTENTION2D: Communication Efficient Distributed Self-Attention Mechanism

Transformer-based models have emerged as a leading architecture for natural language processing, natural language generation, and image generation tasks. A fundamental element of the transformer architecture is self-attention, which allows…

Machine Learning · Computer Science 2025-07-01 Venmugil Elango

Selective Synchronization Attention

The Transformer architecture has become the foundation of modern deep learning, yet its core self-attention mechanism suffers from quadratic computational complexity and lacks grounding in biological neural computation. We propose Selective…

Machine Learning · Computer Science 2026-02-17 Hasi Hays

Transformers Learn Faster with Semantic Focus

Various forms of sparse attention have been explored to mitigate the quadratic computational and memory cost of the attention mechanism in transformers. We study sparse transformers not through a lens of efficiency but rather in terms of…

Machine Learning · Computer Science 2025-06-19 Parikshit Ram , Kenneth L. Clarkson , Tim Klinger , Shashanka Ubaru , Alexander G. Gray

Rethinking Query-Key Pairwise Interactions in Vision Transformers

Vision Transformers have achieved state-of-the-art performance in many visual tasks. Due to the quadratic computational and memory complexities of self-attention, recent works either apply attention only to low-resolution inputs or restrict…

Computer Vision and Pattern Recognition · Computer Science 2022-07-05 Cheng Li , Yangxin Liu

AttentionDrop: A Novel Regularization Method for Transformer Models

Transformer-based architectures achieve state-of-the-art performance across a wide range of tasks in natural language processing, computer vision, and speech processing. However, their immense capacity often leads to overfitting, especially…

Computer Vision and Pattern Recognition · Computer Science 2025-09-22 Mirza Samad Ahmed Baig , Syeda Anshrah Gillani , Abdul Akbar Khan , Shahid Munir Shah , Muhammad Omer Khan

Symmetric Dot-Product Attention for Efficient Training of BERT Language Models

Initially introduced as a machine translation model, the Transformer architecture has now become the foundation for modern deep learning architecture, with applications in a wide range of fields, from computer vision to natural language…

Computation and Language · Computer Science 2024-06-21 Martin Courtois , Malte Ostendorff , Leonhard Hennig , Georg Rehm

Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

The Transformer architecture model, based on self-attention and multi-head attention, has achieved remarkable success in offline end-to-end Automatic Speech Recognition (ASR). However, self-attention and multi-head attention cannot be…

Computation and Language · Computer Science 2022-10-03 Chendong Zhao , Jianzong Wang , Wen qi Wei , Xiaoyang Qu , Haoqian Wang , Jing Xiao

Sinkformers: Transformers with Doubly Stochastic Attention

Attention based models such as Transformers involve pairwise interactions between data points, modeled with a learnable attention matrix. Importantly, this attention matrix is normalized with the SoftMax operator, which makes it row-wise…

Machine Learning · Computer Science 2022-01-25 Michael E. Sander , Pierre Ablin , Mathieu Blondel , Gabriel Peyré

Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

Transformer-based models have emerged as one of the most widely used architectures for natural language processing, natural language generation, and image generation. The size of the state-of-the-art models has increased steadily reaching…

Hardware Architecture · Computer Science 2025-01-15 Rya Sanovar , Srikant Bharadwaj , Renee St. Amant , Victor Rühle , Saravan Rajmohan

Does Self-Attention Need Separate Weights in Transformers?

The success of self-attention lies in its ability to capture long-range dependencies and enhance context understanding, but it is limited by its computational complexity and challenges in handling sequential data with inherent…

Computation and Language · Computer Science 2025-05-05 Md Kowsher , Nusrat Jahan Prottasha , Chun-Nam Yu , Ozlem Ozmen Garibay , Niloofar Yousefi