Related papers: Selective Synchronization Attention

Orthogonal Self-Attention

Softmax Self-Attention (SSA) is a key component of Transformer architectures. However, when utilised within skipless architectures, which aim to improve representation learning, recent work has highlighted the inherent instability of SSA…

Machine Learning · Computer Science 2026-02-06 Leo Zhang , James Martens

Shunted Self-Attention via Multi-Scale Token Aggregation

Recent Vision Transformer~(ViT) models have demonstrated encouraging results across various computer vision tasks, thanks to their competence in modeling long-range dependencies of image patches or tokens via self-attention. These models,…

Computer Vision and Pattern Recognition · Computer Science 2022-04-14 Sucheng Ren , Daquan Zhou , Shengfeng He , Jiashi Feng , Xinchao Wang

Selective Attention: Enhancing Transformer through Principled Context Control

The attention mechanism within the transformer architecture enables the model to weigh and combine tokens based on their relevance to the query. While self-attention has enjoyed major success, it notably treats all queries $q$ in the same…

Machine Learning · Computer Science 2024-11-21 Xuechen Zhang , Xiangyu Chang , Mingchen Li , Amit Roy-Chowdhury , Jiasi Chen , Samet Oymak

Attention-free Spikformer: Mixing Spike Sequences with Simple Linear Transforms

By integrating the self-attention capability and the biological properties of Spiking Neural Networks (SNNs), Spikformer applies the flourishing Transformer architecture to SNNs design. It introduces a Spiking Self-Attention (SSA) module to…

Computer Vision and Pattern Recognition · Computer Science 2023-08-21 Qingyu Wang , Duzhen Zhang , Tielin Zhang , Bo Xu

Combining Aggregated Attention and Transformer Architecture for Accurate and Efficient Performance of Spiking Neural Networks

Spiking Neural Networks have attracted significant attention in recent years due to their distinctive low-power characteristics. Meanwhile, Transformer models, known for their powerful self-attention mechanisms and parallel processing…

Neural and Evolutionary Computing · Computer Science 2024-12-19 Hangming Zhang , Alexander Sboev , Roman Rybka , Qiang Yu

SAMSA: Efficient Transformer for Many Data Modalities

The versatility of self-attention mechanism earned transformers great success in almost all data modalities, with limitations on the quadratic complexity and difficulty of training. Efficient transformers, on the other hand, often rely on…

Machine Learning · Computer Science 2024-08-20 Minh Lenhat , Viet Anh Nguyen , Khoa Nguyen , Duong Duc Hieu , Dao Huu Hung , Truong Son Hy

Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction

The attention operator is arguably the key distinguishing factor of transformer architectures, which have demonstrated state-of-the-art performance on a variety of tasks. However, transformer attention operators often impose a significant…

Machine Learning · Computer Science 2024-12-24 Ziyang Wu , Tianjiao Ding , Yifu Lu , Druv Pai , Jingyuan Zhang , Weida Wang , Yaodong Yu , Yi Ma , Benjamin D. Haeffele

SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

Sparse attention reduces the quadratic complexity of full self-attention but faces two challenges: (1) an attention gap, where applying sparse attention to full-attention-trained models causes performance degradation due to train-inference…

Computation and Language · Computer Science 2026-02-02 Zhenyi Shen , Junru Lu , Lin Gui , Jiazheng Li , Yulan He , Di Yin , Xing Sun

SAGA: Selective Adaptive Gating for Efficient and Expressive Linear Attention

While Transformer architecture excel at modeling long-range dependencies contributing to its widespread adoption in vision tasks the quadratic complexity of softmax-based attention mechanisms imposes a major bottleneck, particularly when…

Computer Vision and Pattern Recognition · Computer Science 2026-03-10 Yuan Cao , Dong Wang

SA-Net: Shuffle Attention for Deep Convolutional Neural Networks

Attention mechanisms, which enable a neural network to accurately focus on all the relevant elements of the input, have become an essential component to improve the performance of deep neural networks. There are mainly two attention…

Computer Vision and Pattern Recognition · Computer Science 2021-02-02 Qing-Long Zhang Yu-Bin Yang

Sparse Sinkhorn Attention

We propose Sparse Sinkhorn Attention, a new efficient and sparse method for learning to attend. Our method is based on differentiable sorting of internal representations. Concretely, we introduce a meta sorting network that learns to…

Machine Learning · Computer Science 2020-02-27 Yi Tay , Dara Bahri , Liu Yang , Donald Metzler , Da-Cheng Juan

Spikformer: When Spiking Neural Network Meets Transformer

We consider two biologically plausible structures, the Spiking Neural Network (SNN) and the self-attention mechanism. The former offers an energy-efficient and event-driven paradigm for deep learning, while the latter has the ability to…

Neural and Evolutionary Computing · Computer Science 2022-11-23 Zhaokun Zhou , Yuesheng Zhu , Chao He , Yaowei Wang , Shuicheng Yan , Yonghong Tian , Li Yuan

Symmetric Dot-Product Attention for Efficient Training of BERT Language Models

Initially introduced as a machine translation model, the Transformer architecture has now become the foundation for modern deep learning architecture, with applications in a wide range of fields, from computer vision to natural language…

Computation and Language · Computer Science 2024-06-21 Martin Courtois , Malte Ostendorff , Leonhard Hennig , Georg Rehm

Sparse Query Attention (SQA): A Computationally Efficient Attention Mechanism with Query Heads Reduction

The Transformer architecture, underpinned by the Multi-Head Attention (MHA) mechanism, has become the de facto standard for state-of-the-art models in artificial intelligence. However, the quadratic computational complexity of MHA with…

Machine Learning · Computer Science 2025-10-03 Adam Filipek

The Information Pathways Hypothesis: Transformers are Dynamic Self-Ensembles

Transformers use the dense self-attention mechanism which gives a lot of flexibility for long-range connectivity. Over multiple layers of a deep transformer, the number of possible connectivity patterns increases exponentially. However,…

Machine Learning · Computer Science 2023-06-05 Md Shamim Hussain , Mohammed J. Zaki , Dharmashankar Subramanian

Krause Synchronization Transformers

Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. When composed across depth, this interaction pattern induces strong synchronization dynamics that…

Machine Learning · Computer Science 2026-05-26 Jingkun Liu , Yisong Yue , Max Welling , Yue Song

Sessa: Selective State Space Attention

Modern sequence modeling is dominated by two families: Transformers, whose self-attention can access arbitrary elements of the visible sequence, and structured state-space models, which propagate information through an explicit recurrent…

Machine Learning · Computer Science 2026-04-22 Liubomyr Horbatko

SATA: Sparsity-Aware Scheduling for Selective Token Attention

Transformers have become the foundation of numerous state-of-the-art AI models across diverse domains, thanks to their powerful attention mechanism for modeling long-range dependencies. However, the quadratic scaling complexity of attention…

Hardware Architecture · Computer Science 2026-01-29 Zhenkun Fan , Zishen Wan , Che-Kai Liu , Ashwin Sanjay Lele , Win-San Khwa , Bo Zhang , Meng-Fan Chang , Arijit Raychowdhury

Gated Slot Attention for Efficient Linear-Time Sequence Modeling

Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant…

Computation and Language · Computer Science 2024-11-01 Yu Zhang , Songlin Yang , Ruijie Zhu , Yue Zhang , Leyang Cui , Yiqiao Wang , Bolun Wang , Freda Shi , Bailin Wang , Wei Bi , Peng Zhou , Guohong Fu

Stochastic Attention: Connectome-Inspired Randomized Routing for Expressive Linear-Time Attention

The whole-brain connectome of a fruit fly comprises over 130K neurons connected with a probability of merely 0.02%, yet achieves an average shortest path of only 4.4 hops. Despite being highly structured at the circuit level, the network's…

Computation and Language · Computer Science 2026-05-06 Zehao Jin , Yanan Sui