Transformers, parallel computation, and logarithmic depth

Clayton Sanford; Daniel Hsu; Matus Telgarsky

Transformers, parallel computation, and logarithmic depth

Machine Learning 2024-02-15 v1

Authors: Clayton Sanford , Daniel Hsu , Matus Telgarsky

Abstract

We show that a constant number of self-attention layers can efficiently simulate, and be simulated by, a constant number of communication rounds of Massively Parallel Computation. As a consequence, we show that logarithmic depth is sufficient for transformers to solve basic computational tasks that cannot be efficiently solved by several other neural sequence models and sub-quadratic transformer approximations. We thus establish parallelism as a key distinguishing property of transformers.

Keywords

transformer parallel algorithm parallel programming

Cite

@article{arxiv.2402.09268,
  title  = {Transformers, parallel computation, and logarithmic depth},
  author = {Clayton Sanford and Daniel Hsu and Matus Telgarsky},
  journal= {arXiv preprint arXiv:2402.09268},
  year   = {2024}
}

Comments

58 pages, 19 figures, code available at https://github.com/chsanford/hop-induction-heads

Related papers

View all related →

Computational Complexity · Computer Science

The Parallelism Tradeoff: Limitations of Log-Precision Transformers