Related papers: Long-Context Generalization with Sparse Attention

Adaptively Sparse Transformers

Attention mechanisms have become ubiquitous in NLP. Recent architectures, notably the Transformer, learn powerful context-aware word representations through layered, multi-headed attention. The multiple heads learn diverse types of word…

Computation and Language · Computer Science 2019-09-09 Gonçalo M. Correia , Vlad Niculae , André F. T. Martins

Sparse and Continuous Attention Mechanisms

Exponential families are widely used in machine learning; they include many distributions in continuous and discrete domains (e.g., Gaussian, Dirichlet, Poisson, and categorical distributions via the softmax transformation). Distributions…

Machine Learning · Computer Science 2020-10-30 André F. T. Martins , António Farinhas , Marcos Treviso , Vlad Niculae , Pedro M. Q. Aguiar , Mário A. T. Figueiredo

Sparse Sequence-to-Sequence Models

Sequence-to-sequence models are a powerful workhorse of NLP. Most variants employ a softmax transformation in both their attention mechanism and output layer, leading to dense alignments and strictly positive output probabilities. This…

Computation and Language · Computer Science 2019-06-14 Ben Peters , Vlad Niculae , André F. T. Martins

Learning to Focus: Focal Attention for Selective and Scalable Transformers

Attention is a core component of transformer architecture, whether encoder-only, decoder-only, or encoder-decoder model. However, the standard softmax attention often produces noisy probability distribution, which can impair effective…

Computation and Language · Computer Science 2025-11-11 Dhananjay Ram , Wei Xia , Stefano Soatto

AdaSplash-2: Faster Differentiable Sparse Attention

Sparse attention has been proposed as a way to alleviate the quadratic cost of transformers, a central bottleneck in long-context training. A promising line of work is $\alpha$-entmax attention, a differentiable sparse alternative to…

Machine Learning · Computer Science 2026-04-17 Nuno Gonçalves , Hugo Pitorro , Vlad Niculae , Edoardo Ponti , Lei Li , Andre Martins , Marcos Treviso

Scalable-Softmax Is Superior for Attention

The maximum element of the vector output by the Softmax function approaches zero as the input vector size increases. Transformer-based language models rely on Softmax to compute attention scores, causing the attention distribution to…

Computation and Language · Computer Science 2025-02-03 Ken M. Nakanishi

Selective Attention: Enhancing Transformer through Principled Context Control

The attention mechanism within the transformer architecture enables the model to weigh and combine tokens based on their relevance to the query. While self-attention has enjoyed major success, it notably treats all queries $q$ in the same…

Machine Learning · Computer Science 2024-11-21 Xuechen Zhang , Xiangyu Chang , Mingchen Li , Amit Roy-Chowdhury , Jiasi Chen , Samet Oymak

Transformers Learn Faster with Semantic Focus

Various forms of sparse attention have been explored to mitigate the quadratic computational and memory cost of the attention mechanism in transformers. We study sparse transformers not through a lens of efficiency but rather in terms of…

Machine Learning · Computer Science 2025-06-19 Parikshit Ram , Kenneth L. Clarkson , Tim Klinger , Shashanka Ubaru , Alexander G. Gray

AdaSplash: Adaptive Sparse Flash Attention

The computational cost of softmax-based attention in transformers limits their applicability to long-context tasks. Adaptive sparsity, of which $\alpha$-entmax attention is an example, offers a flexible data-dependent alternative, but…

Computation and Language · Computer Science 2025-06-10 Nuno Gonçalves , Marcos Treviso , André F. T. Martins

Predicting Attention Sparsity in Transformers

Transformers' quadratic complexity with respect to the input sequence length has motivated a body of work on efficient sparse approximations to softmax. An alternative path, used by entmax transformers, consists of having built-in exact…

Computation and Language · Computer Science 2022-04-22 Marcos Treviso , António Góis , Patrick Fernandes , Erick Fonseca , André F. T. Martins

EntmaxKV: Support-Aware Decoding for Entmax Attention

Long-context decoding is increasingly limited by KV-cache memory traffic since each generated token attends over a cache whose size grows linearly with context length. Existing sparse decoding methods reduce this cost by selecting subsets…

Machine Learning · Computer Science 2026-05-22 Gonçalo Duarte , Miguel Couceiro , Marcos V. Treviso

Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention

Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization. While effective in many settings, this constraint can limit flexibility in controlling attention…

Computation and Language · Computer Science 2026-02-27 Jeongin Bae , Baeseong Park , Gunho Park , Minsub Kim , Joonhyung Lee , Junhee Yoo , Sunghyeon Woo , Jiwon Ryu , Se Jung Kwon , Dongsoo Lee

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state space models, linear attention, and also softmax attention. Yet, existing literature rarely examines the specific effects of…

Computation and Language · Computer Science 2025-05-13 Zihan Qiu , Zekun Wang , Bo Zheng , Zeyu Huang , Kaiyue Wen , Songlin Yang , Rui Men , Le Yu , Fei Huang , Suozhi Huang , Dayiheng Liu , Jingren Zhou , Junyang Lin

Grammatically-Guided Sparse Attention for Efficient and Interpretable Transformers

The quadratic complexity of self-attention in Transformer models remains a significant bottleneck for processing long sequences and deploying large language models efficiently. For this approach, there has been significant research into…

Computation and Language · Computer Science 2026-05-26 Spandan Pratyush

Normalized Attention Without Probability Cage

Attention architectures are widely used; they recently gained renewed popularity with Transformers yielding a streak of state of the art results. Yet, the geometrical implications of softmax-attention remain largely unexplored. In this work…

Machine Learning · Computer Science 2020-05-20 Oliver Richter , Roger Wattenhofer

Speeding Up Entmax

Softmax is the de facto standard in modern neural networks for language processing when it comes to normalizing logits. However, by producing a dense probability distribution each token in the vocabulary has a nonzero chance of being…

Computation and Language · Computer Science 2022-05-20 Maxat Tezekbayev , Vassilina Nikoulina , Matthias Gallé , Zhenisbek Assylbekov

The Closeness of In-Context Learning and Weight Shifting for Softmax Regression

Large language models (LLMs) are known for their exceptional performance in natural language processing, making them highly effective in many human life-related or even job-related tasks. The attention mechanism in the Transformer…

Computation and Language · Computer Science 2023-04-27 Shuai Li , Zhao Song , Yu Xia , Tong Yu , Tianyi Zhou

Statistical Advantage of Softmax Attention: Insights from Single-Location Regression

Large language models rely on attention mechanisms with a softmax activation. Yet the dominance of softmax over alternatives (e.g., component-wise or linear) remains poorly understood, and many theoretical works have focused on the…

Machine Learning · Computer Science 2026-02-27 O. Duranthon , P. Marion , C. Boyer , B. Loureiro , L. Zdeborová

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Existing approaches to address this complexity…

Computation and Language · Computer Science 2025-09-04 Qianchao Zhu , Jiangfei Duan , Chang Chen , Siran Liu , Guanyu Feng , Xin Lv , Xiao Chuanfu , Dahua Lin , Chao Yang

Attention Needs to Focus: A Unified Perspective on Attention Allocation

The Transformer architecture, a cornerstone of modern Large Language Models (LLMs), has achieved extraordinary success in sequence modeling, primarily due to its attention mechanism. However, despite its power, the standard attention…

Machine Learning · Computer Science 2026-01-08 Zichuan Fu , Wentao Song , Guojing Li , Yejing Wang , Xian Wu , Yimin Deng , Hanyu Yan , Yefeng Zheng , Xiangyu Zhao