Reformer: The Efficient Transformer

Nikita Kitaev; Łukasz Kaiser; Anselm Levskaya

Reformer: The Efficient Transformer

Machine Learning 2020-02-19 v2 Computation and Language Machine Learning

Authors: Nikita Kitaev , Łukasz Kaiser , Anselm Levskaya

Abstract

Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O( $L^2$ ) to O( $L\log L$ ), where $L$ is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of $N$ times, where $N$ is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.

Keywords

transformer attention mechanism model transformation

Cite

@article{arxiv.2001.04451,
  title  = {Reformer: The Efficient Transformer},
  author = {Nikita Kitaev and Łukasz Kaiser and Anselm Levskaya},
  journal= {arXiv preprint arXiv:2001.04451},
  year   = {2020}
}

Comments

ICLR 2020

Reformer: The Efficient Transformer

Abstract

Keywords

Cite

Comments

Related papers