Cached Transformers: Improving Transformers with Differentiable Memory Cache

Zhaoyang Zhang; Wenqi Shao; Yixiao Ge; Xiaogang Wang; Jinwei Gu; Ping Luo

Cached Transformers: Improving Transformers with Differentiable Memory Cache

Computer Vision and Pattern Recognition 2023-12-21 v1

Authors: Zhaoyang Zhang , Wenqi Shao , Yixiao Ge , Xiaogang Wang , Jinwei Gu , Ping Luo

Abstract

This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens. GRC attention enables attending to both past and current tokens, increasing the receptive field of attention and allowing for exploring long-range dependencies. By utilizing a recurrent gating unit to continuously update the cache, our model achieves significant advancements in \textbf{six} language and vision tasks, including language modeling, machine translation, ListOPs, image classification, object detection, and instance segmentation. Furthermore, our approach surpasses previous memory-based techniques in tasks such as language modeling and displays the ability to be applied to a broader range of situations.

Keywords

attention mechanism graph signal processing transformer

Cite

@article{arxiv.2312.12742,
  title  = {Cached Transformers: Improving Transformers with Differentiable Memory Cache},
  author = {Zhaoyang Zhang and Wenqi Shao and Yixiao Ge and Xiaogang Wang and Jinwei Gu and Ping Luo},
  journal= {arXiv preprint arXiv:2312.12742},
  year   = {2023}
}

Comments

AAAI 2024

Cached Transformers: Improving Transformers with Differentiable Memory Cache

Abstract

Keywords

Cite

Comments

Related papers