Block-State Transformers

Mahan Fathi; Jonathan Pilault; Orhan Firat; Christopher Pal; Pierre-Luc Bacon; Ross Goroshin

Block-State Transformers

Computation and Language 2023-10-31 v4 Machine Learning

Authors: Mahan Fathi , Jonathan Pilault , Orhan Firat , Christopher Pal , Pierre-Luc Bacon , Ross Goroshin

Abstract

State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long sequences owing to their subquadratic runtime complexity. Originally designed for continuous signals, SSMs have shown superior performance on a plethora of tasks, in vision and audio; however, SSMs still lag Transformer performance in Language Modeling tasks. In this work, we propose a hybrid layer named Block-State Transformer (BST), that internally combines an SSM sublayer for long-range contextualization, and a Block Transformer sublayer for short-term representation of sequences. We study three different, and completely parallelizable, variants that integrate SSMs and block-wise attention. We show that our model outperforms similar Transformer-based architectures on language modeling perplexity and generalizes to longer sequences. In addition, the Block-State Transformer demonstrates more than tenfold increase in speed at the layer level compared to the Block-Recurrent Transformer when model parallelization is employed.

Keywords

speech translation transformer long short-term memory

Cite

@article{arxiv.2306.09539,
  title  = {Block-State Transformers},
  author = {Mahan Fathi and Jonathan Pilault and Orhan Firat and Christopher Pal and Pierre-Luc Bacon and Ross Goroshin},
  journal= {arXiv preprint arXiv:2306.09539},
  year   = {2023}
}

Comments

NeurIPS'23 - Thirty-seventh Conference on Neural Information Processing Systems

Block-State Transformers

Abstract

Keywords

Cite

Comments

Related papers