English

Pay Attention when Required

Machine Learning 2021-05-18 v3 Computation and Language

Abstract

Transformer-based models consist of interleaved feed-forward blocks - that capture content meaning, and relatively more expensive self-attention blocks - that capture context meaning. In this paper, we explored trade-offs and ordering of the blocks to improve upon the current Transformer architecture and proposed PAR Transformer. It needs 35% lower compute time than Transformer-XL achieved by replacing ~63% of the self-attention blocks with feed-forward blocks, and retains the perplexity on WikiText-103 language modelling benchmark. We further validated our results on text8 and enwiki8 datasets, as well as on the BERT model.

Keywords

Cite

@article{arxiv.2009.04534,
  title  = {Pay Attention when Required},
  author = {Swetha Mandava and Szymon Migacz and Alex Fit Florea},
  journal= {arXiv preprint arXiv:2009.04534},
  year   = {2021}
}

Comments

9 pages, 5 figures, 7 tables

R2 v1 2026-06-23T18:25:43.757Z