Related papers: Efficient Content-Based Sparse Attention with Rout…

Faster Causal Attention Over Large Sequences Through Sparse Flash Attention

Transformer-based language models have found many diverse applications requiring them to process sequences of increasing length. For these applications, the causal self-attention -- which is the only component scaling quadratically w.r.t.…

Machine Learning · Computer Science 2023-06-05 Matteo Pagliardini , Daniele Paliotta , Martin Jaggi , François Fleuret

Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection

Self-attention based Transformer has demonstrated the state-of-the-art performances in a number of natural language processing tasks. Self-attention is able to model long-term dependencies, but it may suffer from the extraction of…

Computation and Language · Computer Science 2019-12-30 Guangxiang Zhao , Junyang Lin , Zhiyuan Zhang , Xuancheng Ren , Qi Su , Xu Sun

Sparse Sinkhorn Attention

We propose Sparse Sinkhorn Attention, a new efficient and sparse method for learning to attend. Our method is based on differentiable sorting of internal representations. Concretely, we introduce a meta sorting network that learns to…

Machine Learning · Computer Science 2020-02-27 Yi Tay , Dara Bahri , Liu Yang , Donald Metzler , Da-Cheng Juan

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

Accommodating long sequences efficiently in autoregressive Transformers, especially within an extended context window, poses significant challenges due to the quadratic computational complexity and substantial KV memory requirements…

Computation and Language · Computer Science 2024-06-25 Chao Lou , Zixia Jia , Zilong Zheng , Kewei Tu

$\pi$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling

Transformers have revolutionized natural language processing, but their quadratic complexity with respect to sequence length remains a fundamental bottleneck for long-range modeling. While sparse attention mechanisms like RingAttention…

Computation and Language · Computer Science 2026-03-31 Dong Liu , Yanxuan Yu

RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference

The quadratic complexity of attention mechanisms poses a critical bottleneck for large language models processing long contexts. While dynamic sparse attention methods offer input-adaptive efficiency, they face fundamental trade-offs:…

Computation and Language · Computer Science 2026-02-06 Siran Liu , Guoxia Wang , Sa Wang , Jinle Zeng , HaoYang Xie , Siyu Lou , JiaBin Yang , DianHai Yu , Haifeng Wang , Chao Yang

Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques

Transformers have demonstrated great success in numerous domains including natural language processing and bioinformatics. This success stems from the use of the attention mechanism by these models in order to represent and propagate…

Machine Learning · Computer Science 2025-02-10 Nathaniel Tomczak , Sanmukh Kuppannagari

SparseBERT: Rethinking the Importance Analysis in Self-attention

Transformer-based models are popularly used in natural language processing (NLP). Its core component, self-attention, has aroused widespread interest. To understand the self-attention mechanism, a direct method is to visualize the attention…

Machine Learning · Computer Science 2021-07-02 Han Shi , Jiahui Gao , Xiaozhe Ren , Hang Xu , Xiaodan Liang , Zhenguo Li , James T. Kwok

Efficient Attention Mechanisms for Large Language Models: A Survey

Transformer-based architectures have become the prevailing backbone of large language models. However, the quadratic time and memory complexity of self-attention remains a fundamental obstacle to efficient long-context modeling. To address…

Computation and Language · Computer Science 2026-02-10 Yutao Sun , Zhenyu Li , Yike Zhang , Tengyu Pan , Bowen Dong , Yuyi Guo , Jianyong Wang

Grammatically-Guided Sparse Attention for Efficient and Interpretable Transformers

The quadratic complexity of self-attention in Transformer models remains a significant bottleneck for processing long sequences and deploying large language models efficiently. For this approach, there has been significant research into…

Computation and Language · Computer Science 2026-05-26 Spandan Pratyush

ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers

Although Transformers have successfully transitioned from their language modelling origins to image-based applications, their quadratic computational complexity remains a challenge, particularly for dense prediction. In this paper we…

Computer Vision and Pattern Recognition · Computer Science 2022-08-30 Yutong Xie , Jianpeng Zhang , Yong Xia , Anton van den Hengel , Qi Wu

Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model

Transformers have shown dominant performance across a range of domains including language and vision. However, their computational cost grows quadratically with the sequence length, making their usage prohibitive for resource-constrained…

Computation and Language · Computer Science 2023-10-24 Yinghan Long , Sayeed Shafayet Chowdhury , Kaushik Roy

Generating Long Sequences with Sparse Transformers

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also…

Machine Learning · Computer Science 2019-04-25 Rewon Child , Scott Gray , Alec Radford , Ilya Sutskever

Smart Bird: Learnable Sparse Attention for Efficient and Effective Transformer

Transformer has achieved great success in NLP. However, the quadratic complexity of the self-attention mechanism in Transformer makes it inefficient in handling long sequences. Many existing works explore to accelerate Transformers by…

Computation and Language · Computer Science 2021-09-03 Chuhan Wu , Fangzhao Wu , Tao Qi , Binxing Jiao , Daxin Jiang , Yongfeng Huang , Xing Xie

Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

The Transformer architecture model, based on self-attention and multi-head attention, has achieved remarkable success in offline end-to-end Automatic Speech Recognition (ASR). However, self-attention and multi-head attention cannot be…

Computation and Language · Computer Science 2022-10-03 Chendong Zhao , Jianzong Wang , Wen qi Wei , Xiaoyang Qu , Haoqian Wang , Jing Xiao

Predicting Attention Sparsity in Transformers

Transformers' quadratic complexity with respect to the input sequence length has motivated a body of work on efficient sparse approximations to softmax. An alternative path, used by entmax transformers, consists of having built-in exact…

Computation and Language · Computer Science 2022-04-22 Marcos Treviso , António Góis , Patrick Fernandes , Erick Fonseca , André F. T. Martins

Inductive Biases and Variable Creation in Self-Attention Mechanisms

Self-attention, an architectural motif designed to model long-range interactions in sequential data, has driven numerous recent breakthroughs in natural language processing and beyond. This work provides a theoretical analysis of the…

Machine Learning · Computer Science 2022-06-27 Benjamin L. Edelman , Surbhi Goel , Sham Kakade , Cyril Zhang

Efficient Transformer-Based Piano Transcription With Sparse Attention Mechanisms

This paper investigates automatic piano transcription based on computationally-efficient yet high-performant variants of the Transformer that can capture longer-term dependency over the whole musical piece. Recently, transformer-based…

Sound · Computer Science 2025-09-12 Weixing Wei , Kazuyoshi Yoshii

Ripple sparse self-attention for monaural speech enhancement

The use of Transformer represents a recent success in speech enhancement. However, as its core component, self-attention suffers from quadratic complexity, which is computationally prohibited for long speech recordings. Moreover, it allows…

Sound · Computer Science 2023-05-16 Qiquan Zhang , Hongxu Zhu , Qi Song , Xinyuan Qian , Zhaoheng Ni , Haizhou Li

Linformer: Self-Attention with Linear Complexity

Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences,…

Machine Learning · Computer Science 2020-06-16 Sinong Wang , Belinda Z. Li , Madian Khabsa , Han Fang , Hao Ma