English

Long-Context Generalization with Sparse Attention

Computation and Language 2026-03-03 v4 Artificial Intelligence

Abstract

Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for tasks that demand precise focus on fixed-size patterns: as sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse. We show in this paper that dynamically sparse attention mechanisms using α\alpha-entmax can avoid these issues, due to their ability to assign exact zeros to irrelevant tokens. Furthermore, we introduce Adaptive-Scalable Entmax (ASEntmax), which endows α\alpha-entmax with a learnable temperature parameter, allowing the attention distribution to interpolate between sparse (pattern-focused) and dense (softmax-like) regimes. Our empirical evaluation on synthetic tasks and language modeling demonstrates that ASEntmax substantially outperforms softmax, scalable softmax, and fixed-temperature α\alpha-entmax baselines, achieving up to 1000×\times length extrapolation on synthetic benchmarks and superior long-context generalization on language modeling while preserving short-context performance, including better perplexity trends and higher retrieval accuracies at 8×\times training length.

Keywords

Cite

@article{arxiv.2506.16640,
  title  = {Long-Context Generalization with Sparse Attention},
  author = {Pavlo Vasylenko and Hugo Pitorro and André F. T. Martins and Marcos Treviso},
  journal= {arXiv preprint arXiv:2506.16640},
  year   = {2026}
}

Comments

ICLR 2026