English

Superlinear Multi-Step Attention

Machine Learning 2026-01-27 v1

Abstract

In this paper, we propose \textbf{Superlinear attention}, a fully trainable multi-step attention architecture that achieves subquadratic complexity for long sequences while preserving \textbf{random context access} (a.k.a.\ structural non-exclusion): no eligible token position is structurally excluded from being selected for attention. Superlinear attention reformulates standard causal self-attention as a multi-step search problem with NN steps, yielding an overall complexity of O(L1+1N)O(L^{1+\frac{1}{N}}). To illustrate the architecture, we present a baseline N=2N=2 implementation, which is algorithmically analogous to standard jump search. In this O(L3/2)O(L^{3/2}) instantiation, the first step performs O(L3/2)O(L^{3/2}) span-search to select relevant spans of the sequence, and the second step applies O(L3/2)O(L^{3/2}) span-attention (standard attention restricted to the selected spans). In an upscaled O(L1.54)O(L^{1.54}) configuration for robustness, we achieve an average decoding throughput of 114 tokens/sec at 1M context length and 80 tokens/sec at 10M context in our implementation on a modified 30B hybrid MoE model on a single B200 GPU. With limited training, we also obtain strong performance on the NIAH (Needle In A Haystack) task up to 256K context length, demonstrating that the routed span selection is learnable end-to-end. This paper emphasizes architectural formulation, scaling analysis, and systems feasibility, and presents initial validation; comprehensive quality evaluations across diverse long-context tasks are left to future work.

Keywords

Cite

@article{arxiv.2601.18401,
  title  = {Superlinear Multi-Step Attention},
  author = {Yufeng Huang},
  journal= {arXiv preprint arXiv:2601.18401},
  year   = {2026}
}

Comments

30 pages, 6 figures

R2 v1 2026-07-01T09:20:10.403Z