Foundation Transformers

Hongyu Wang; Shuming Ma; Shaohan Huang; Li Dong; Wenhui Wang; Zhiliang Peng; Yu Wu; Payal Bajaj; Saksham Singhal; Alon Benhaim; Barun Patra; Zhun Liu; Vishrav Chaudhary; Xia Song; Furu Wei

Foundation Transformers

Machine Learning 2022-10-20 v2 Computation and Language Computer Vision and Pattern Recognition

Authors: Hongyu Wang , Shuming Ma , Shaohan Huang , Li Dong , Wenhui Wang , Zhiliang Peng , Yu Wu , Payal Bajaj , Saksham Singhal , Alon Benhaim , Barun Patra , Zhun Liu , Vishrav Chaudhary , Xia Song , Furu Wei

View on arXiv ↗ PDF ↗

Abstract

A big convergence of model architectures across language, vision, speech, and multimodal is emerging. However, under the same name "Transformers", the above areas use different implementations for better performance, e.g., Post-LayerNorm for BERT, and Pre-LayerNorm for GPT and vision Transformers. We call for the development of Foundation Transformer for true general-purpose modeling, which serves as a go-to architecture for various tasks and modalities with guaranteed training stability. In this work, we introduce a Transformer variant, named Magneto, to fulfill the goal. Specifically, we propose Sub-LayerNorm for good expressivity, and the initialization strategy theoretically derived from DeepNet for stable scaling up. Extensive experiments demonstrate its superior performance and better stability than the de facto Transformer variants designed for various applications, including language modeling (i.e., BERT, and GPT), machine translation, vision pretraining (i.e., BEiT), speech recognition, and multimodal pretraining (i.e., BEiT-3).

Keywords

transformer pre-trained language model model transformation

Cite

@article{arxiv.2210.06423,
  title  = {Foundation Transformers},
  author = {Hongyu Wang and Shuming Ma and Shaohan Huang and Li Dong and Wenhui Wang and Zhiliang Peng and Yu Wu and Payal Bajaj and Saksham Singhal and Alon Benhaim and Barun Patra and Zhun Liu and Vishrav Chaudhary and Xia Song and Furu Wei},
  journal= {arXiv preprint arXiv:2210.06423},
  year   = {2022}
}

Comments

Work in progress

Foundation Transformers

Abstract

Keywords

Cite

Comments

Related papers