Related papers: Superlinear Multi-Step Attention

Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models

Linear attention is an efficient attention mechanism that has recently emerged as a promising alternative to conventional softmax attention. With its ability to process tokens in linear computational complexities, linear attention, in…

Computation and Language · Computer Science 2024-01-17 Zhen Qin , Weigao Sun , Dong Li , Xuyang Shen , Weixuan Sun , Yiran Zhong

Staircase Attention for Recurrent Processing of Sequences

Attention mechanisms have become a standard tool for sequence modeling tasks, in particular by stacking self-attention layers over the entire input sequence as in the Transformer architecture. In this work we introduce a novel attention…

Machine Learning · Computer Science 2021-06-09 Da Ju , Stephen Roller , Sainbayar Sukhbaatar , Jason Weston

One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space

Attention computation takes both the time complexity of $O(n^2)$ and the space complexity of $O(n^2)$ simultaneously, which makes deploying Large Language Models (LLMs) in streaming applications that involve long contexts requiring…

Machine Learning · Computer Science 2024-02-06 Raghav Addanki , Chenyang Li , Zhao Song , Chiwun Yang

LinearSR: Unlocking Linear Attention for Stable and Efficient Image Super-Resolution

Generative models for Image Super-Resolution (SR) are increasingly powerful, yet their reliance on self-attention's quadratic complexity (O(N^2)) creates a major computational bottleneck. Linear Attention offers an O(N) solution, but its…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Xiaohui Li , Shaobin Zhuang , Shuo Cao , Yang Yang , Yuandong Pu , Qi Qin , Siqi Luo , Bin Fu , Yihao Liu

Efficient Attention Mechanisms for Large Language Models: A Survey

Transformer-based architectures have become the prevailing backbone of large language models. However, the quadratic time and memory complexity of self-attention remains a fundamental obstacle to efficient long-context modeling. To address…

Computation and Language · Computer Science 2026-02-10 Yutao Sun , Zhenyu Li , Yike Zhang , Tengyu Pan , Bowen Dong , Yuyi Guo , Jianyong Wang

Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently evict tokens at…

Computation and Language · Computer Science 2026-05-04 Dongwon Jo , Beomseok Kang , Jiwon Song , Jae-Joon Kim

MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning

In sequence to sequence learning, the self-attention mechanism proves to be highly effective, and achieves significant improvements in many tasks. However, the self-attention mechanism is not without its own flaws. Although self-attention…

Computation and Language · Computer Science 2019-11-22 Guangxiang Zhao , Xu Sun , Jingjing Xu , Zhiyuan Zhang , Liangchen Luo

Rethinking Transformer Connectivity: TLinFormer, A Path to Exact, Full Context-Aware Linear Attention

The Transformer architecture has become a cornerstone of modern artificial intelligence, but its core self-attention mechanism suffers from a complexity bottleneck that scales quadratically with sequence length, severely limiting its…

Machine Learning · Computer Science 2025-08-29 Zhongpan Tang

Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models

Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and…

Computation and Language · Computer Science 2026-05-01 Jiaqi Leng , Xiang Hu , Junxiong Wang , Jianguo Li , Wei Wu , Yucheng Lu

Pointer: Linear-Complexity Long-Range Modeling without Pre-training

We introduce Pointer, a novel architecture that achieves linear $O(NK)$ complexity for long-range sequence modeling while maintaining superior performance without requiring pre-training. Unlike standard attention mechanisms that compute…

Computation and Language · Computer Science 2025-08-05 Zixi Li

Tailoring Self-Attention for Graph via Rooted Subtrees

Attention mechanisms have made significant strides in graph learning, yet they still exhibit notable limitations: local attention faces challenges in capturing long-range information due to the inherent problems of the message-passing…

Machine Learning · Computer Science 2023-10-10 Siyuan Huang , Yunchong Song , Jiayue Zhou , Zhouhan Lin

Attention Mechanisms Through the Lens of Numerical Methods: Approximation Methods and Alternative Formulations

The attention mechanism is the computational core of modern Transformer architectures, but its quadratic complexity in the input sequence length is the bottleneck for large-scale inference. This has motivated a rapidly growing body of work…

Numerical Analysis · Mathematics 2026-04-03 Michel Fabrice Serret , Alice Cortinovis , Yijun Dong , Diana Halikias , Anna Ma , Fabio Matti , Deanna Needell , Katherine J. Pearce , Elizaveta Rebrova , Disha Shur , Rudi Smith , Hai-Xiao Wang , Laura Grigori

Correlation-Aware Select and Merge Attention for Efficient Fine-Tuning and Context Length Extension

Modeling long sequences is crucial for various large-scale models; however, extending existing architectures to handle longer sequences presents significant technical and resource challenges. In this paper, we propose an efficient and…

Computation and Language · Computer Science 2024-10-08 Ning Wang , Zekun Li , Tongxin Bai , Guoqi Li

Striped Attention: Faster Ring Attention for Causal Transformers

To help address the growing demand for ever-longer sequence lengths in transformer models, Liu et al. recently proposed Ring Attention, an exact attention algorithm capable of overcoming per-device memory bottle- necks by distributing…

Machine Learning · Computer Science 2023-11-17 William Brandon , Aniruddha Nrusimha , Kevin Qian , Zachary Ankner , Tian Jin , Zhiye Song , Jonathan Ragan-Kelley

Transformer Based Linear Attention with Optimized GPU Kernel Implementation

The original softmax-based attention mechanism (regular attention) in the extremely successful Transformer architecture computes attention between $N$ tokens, each embedded in a $D$-dimensional head, with a time complexity of $O(N^2D)$.…

Machine Learning · Computer Science 2025-10-28 Armin Gerami , Ramani Duraiswami

Faster Causal Attention Over Large Sequences Through Sparse Flash Attention

Transformer-based language models have found many diverse applications requiring them to process sequences of increasing length. For these applications, the causal self-attention -- which is the only component scaling quadratically w.r.t.…

Machine Learning · Computer Science 2023-06-05 Matteo Pagliardini , Daniele Paliotta , Martin Jaggi , François Fleuret

Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques

Transformers have demonstrated great success in numerous domains including natural language processing and bioinformatics. This success stems from the use of the attention mechanism by these models in order to represent and propagate…

Machine Learning · Computer Science 2025-02-10 Nathaniel Tomczak , Sanmukh Kuppannagari

H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences

We describe an efficient hierarchical method to compute attention in the Transformer architecture. The proposed attention mechanism exploits a matrix structure similar to the Hierarchical Matrix (H-Matrix) developed by the numerical…

Machine Learning · Computer Science 2021-07-27 Zhenhai Zhu , Radu Soricut

Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long Context Horizons

We present a novel non attention based architecture for large language models (LLMs) that efficiently handles very long context windows, on the order of hundreds of thousands to potentially millions of tokens. Unlike traditional Transformer…

Machine Learning · Computer Science 2025-06-04 Andrew Kiruluta , Preethi Raju , Priscilla Burity

On Fine-Grained I/O Complexity of Attention Backward Passes

Large Language Models (LLMs) exhibit exceptional proficiency in handling extensive context windows in natural language. Nevertheless, the quadratic scaling of attention computation relative to sequence length creates substantial efficiency…

Machine Learning · Computer Science 2026-01-26 Xiaoyu Li , Yingyu Liang , Zhenmei Shi , Zhao Song , Song Yue , Jiahao Zhang