Optimizing Deeper Transformers on Small Datasets

Peng Xu; Dhruv Kumar; Wei Yang; Wenjie Zi; Keyi Tang; Chenyang Huang; Jackie Chi Kit Cheung; Simon J. D. Prince; Yanshuai Cao

Optimizing Deeper Transformers on Small Datasets

Computation and Language 2021-06-01 v4 Machine Learning

Authors: Peng Xu , Dhruv Kumar , Wei Yang , Wenjie Zi , Keyi Tang , Chenyang Huang , Jackie Chi Kit Cheung , Simon J. D. Prince , Yanshuai Cao

View on arXiv ↗ PDF ↗

Abstract

It is a common belief that training deep transformers from scratch requires large datasets. Consequently, for small datasets, people usually use shallow and simple additional layers on top of pre-trained models during fine-tuning. This work shows that this does not always need to be the case: with proper initialization and optimization, the benefits of very deep transformers can carry over to challenging tasks with small datasets, including Text-to-SQL semantic parsing and logical reading comprehension. In particular, we successfully train $48$ layers of transformers, comprising $24$ fine-tuned layers from pre-trained RoBERTa and $24$ relation-aware layers trained from scratch. With fewer training steps and no task-specific pre-training, we obtain the state-of-the-art performance on the challenging cross-domain Text-to-SQL parsing benchmark Spider. We achieve this by deriving a novel Data-dependent Transformer Fixed-update initialization scheme (DT-Fixup), inspired by the prior T-Fixup work. Further error analysis shows that increasing depth can help improve generalization on small datasets for hard cases that require reasoning and structural understanding.

Keywords

transformer deep learning parameter-efficient fine-tuning

Cite

@article{arxiv.2012.15355,
  title  = {Optimizing Deeper Transformers on Small Datasets},
  author = {Peng Xu and Dhruv Kumar and Wei Yang and Wenjie Zi and Keyi Tang and Chenyang Huang and Jackie Chi Kit Cheung and Simon J. D. Prince and Yanshuai Cao},
  journal= {arXiv preprint arXiv:2012.15355},
  year   = {2021}
}

Comments

Accepted at ACL 2021 main conference

Optimizing Deeper Transformers on Small Datasets

Abstract

Keywords

Cite

Comments

Related papers