Long-Context Generalization with Sparse Attention

Pavlo Vasylenko; Hugo Pitorro; André F. T. Martins; Marcos Treviso

Long-Context Generalization with Sparse Attention

Computation and Language 2026-03-03 v4 Artificial Intelligence

Authors: Pavlo Vasylenko , Hugo Pitorro , André F. T. Martins , Marcos Treviso

Abstract

Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for tasks that demand precise focus on fixed-size patterns: as sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse. We show in this paper that dynamically sparse attention mechanisms using $\alpha$ -entmax can avoid these issues, due to their ability to assign exact zeros to irrelevant tokens. Furthermore, we introduce Adaptive-Scalable Entmax (ASEntmax), which endows $\alpha$ -entmax with a learnable temperature parameter, allowing the attention distribution to interpolate between sparse (pattern-focused) and dense (softmax-like) regimes. Our empirical evaluation on synthetic tasks and language modeling demonstrates that ASEntmax substantially outperforms softmax, scalable softmax, and fixed-temperature $\alpha$ -entmax baselines, achieving up to 1000 $\times$ length extrapolation on synthetic benchmarks and superior long-context generalization on language modeling while preserving short-context performance, including better perplexity trends and higher retrieval accuracies at 8 $\times$ training length.

Keywords

attention mechanism sparse learning sequence alignment

Cite

@article{arxiv.2506.16640,
  title  = {Long-Context Generalization with Sparse Attention},
  author = {Pavlo Vasylenko and Hugo Pitorro and André F. T. Martins and Marcos Treviso},
  journal= {arXiv preprint arXiv:2506.16640},
  year   = {2026}
}

Comments

ICLR 2026

Long-Context Generalization with Sparse Attention

Abstract

Keywords

Cite

Comments

Related papers