English
Related papers

Related papers: Superlinear Multi-Step Attention

200 papers

Linear attention is an efficient attention mechanism that has recently emerged as a promising alternative to conventional softmax attention. With its ability to process tokens in linear computational complexities, linear attention, in…

Computation and Language · Computer Science 2024-01-17 Zhen Qin , Weigao Sun , Dong Li , Xuyang Shen , Weixuan Sun , Yiran Zhong

Attention mechanisms have become a standard tool for sequence modeling tasks, in particular by stacking self-attention layers over the entire input sequence as in the Transformer architecture. In this work we introduce a novel attention…

Machine Learning · Computer Science 2021-06-09 Da Ju , Stephen Roller , Sainbayar Sukhbaatar , Jason Weston

Attention computation takes both the time complexity of $O(n^2)$ and the space complexity of $O(n^2)$ simultaneously, which makes deploying Large Language Models (LLMs) in streaming applications that involve long contexts requiring…

Machine Learning · Computer Science 2024-02-06 Raghav Addanki , Chenyang Li , Zhao Song , Chiwun Yang

Generative models for Image Super-Resolution (SR) are increasingly powerful, yet their reliance on self-attention's quadratic complexity (O(N^2)) creates a major computational bottleneck. Linear Attention offers an O(N) solution, but its…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Xiaohui Li , Shaobin Zhuang , Shuo Cao , Yang Yang , Yuandong Pu , Qi Qin , Siqi Luo , Bin Fu , Yihao Liu

Transformer-based architectures have become the prevailing backbone of large language models. However, the quadratic time and memory complexity of self-attention remains a fundamental obstacle to efficient long-context modeling. To address…

Computation and Language · Computer Science 2026-02-10 Yutao Sun , Zhenyu Li , Yike Zhang , Tengyu Pan , Bowen Dong , Yuyi Guo , Jianyong Wang

The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently evict tokens at…

Computation and Language · Computer Science 2026-05-04 Dongwon Jo , Beomseok Kang , Jiwon Song , Jae-Joon Kim

In sequence to sequence learning, the self-attention mechanism proves to be highly effective, and achieves significant improvements in many tasks. However, the self-attention mechanism is not without its own flaws. Although self-attention…

Computation and Language · Computer Science 2019-11-22 Guangxiang Zhao , Xu Sun , Jingjing Xu , Zhiyuan Zhang , Liangchen Luo

The Transformer architecture has become a cornerstone of modern artificial intelligence, but its core self-attention mechanism suffers from a complexity bottleneck that scales quadratically with sequence length, severely limiting its…

Machine Learning · Computer Science 2025-08-29 Zhongpan Tang

Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and…

Computation and Language · Computer Science 2026-05-01 Jiaqi Leng , Xiang Hu , Junxiong Wang , Jianguo Li , Wei Wu , Yucheng Lu

We introduce Pointer, a novel architecture that achieves linear $O(NK)$ complexity for long-range sequence modeling while maintaining superior performance without requiring pre-training. Unlike standard attention mechanisms that compute…

Computation and Language · Computer Science 2025-08-05 Zixi Li

Attention mechanisms have made significant strides in graph learning, yet they still exhibit notable limitations: local attention faces challenges in capturing long-range information due to the inherent problems of the message-passing…

Machine Learning · Computer Science 2023-10-10 Siyuan Huang , Yunchong Song , Jiayue Zhou , Zhouhan Lin

The attention mechanism is the computational core of modern Transformer architectures, but its quadratic complexity in the input sequence length is the bottleneck for large-scale inference. This has motivated a rapidly growing body of work…

Modeling long sequences is crucial for various large-scale models; however, extending existing architectures to handle longer sequences presents significant technical and resource challenges. In this paper, we propose an efficient and…

Computation and Language · Computer Science 2024-10-08 Ning Wang , Zekun Li , Tongxin Bai , Guoqi Li

To help address the growing demand for ever-longer sequence lengths in transformer models, Liu et al. recently proposed Ring Attention, an exact attention algorithm capable of overcoming per-device memory bottle- necks by distributing…

Machine Learning · Computer Science 2023-11-17 William Brandon , Aniruddha Nrusimha , Kevin Qian , Zachary Ankner , Tian Jin , Zhiye Song , Jonathan Ragan-Kelley

The original softmax-based attention mechanism (regular attention) in the extremely successful Transformer architecture computes attention between $N$ tokens, each embedded in a $D$-dimensional head, with a time complexity of $O(N^2D)$.…

Machine Learning · Computer Science 2025-10-28 Armin Gerami , Ramani Duraiswami

Transformer-based language models have found many diverse applications requiring them to process sequences of increasing length. For these applications, the causal self-attention -- which is the only component scaling quadratically w.r.t.…

Machine Learning · Computer Science 2023-06-05 Matteo Pagliardini , Daniele Paliotta , Martin Jaggi , François Fleuret

Transformers have demonstrated great success in numerous domains including natural language processing and bioinformatics. This success stems from the use of the attention mechanism by these models in order to represent and propagate…

Machine Learning · Computer Science 2025-02-10 Nathaniel Tomczak , Sanmukh Kuppannagari

We describe an efficient hierarchical method to compute attention in the Transformer architecture. The proposed attention mechanism exploits a matrix structure similar to the Hierarchical Matrix (H-Matrix) developed by the numerical…

Machine Learning · Computer Science 2021-07-27 Zhenhai Zhu , Radu Soricut

We present a novel non attention based architecture for large language models (LLMs) that efficiently handles very long context windows, on the order of hundreds of thousands to potentially millions of tokens. Unlike traditional Transformer…

Machine Learning · Computer Science 2025-06-04 Andrew Kiruluta , Preethi Raju , Priscilla Burity

Large Language Models (LLMs) exhibit exceptional proficiency in handling extensive context windows in natural language. Nevertheless, the quadratic scaling of attention computation relative to sequence length creates substantial efficiency…

Machine Learning · Computer Science 2026-01-26 Xiaoyu Li , Yingyu Liang , Zhenmei Shi , Zhao Song , Song Yue , Jiahao Zhang
‹ Prev 1 2 3 10 Next ›