Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

Xupeng Miao; Yujie Wang; Youhe Jiang; Chunan Shi; Xiaonan Nie; Hailin Zhang; Bin Cui

doi:10.14778/3570690.3570697

Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

Machine Learning 2022-11-28 v1 Databases Distributed, Parallel, and Cluster Computing

Authors: Xupeng Miao , Yujie Wang , Youhe Jiang , Chunan Shi , Xiaonan Nie , Hailin Zhang , Bin Cui

View on arXiv ↗ PDF ↗ DOI ↗

Abstract

Transformer models have achieved state-of-the-art performance on various domains of applications and gradually becomes the foundations of the advanced large deep learning (DL) models. However, how to train these models over multiple GPUs efficiently is still challenging due to a large number of parallelism choices. Existing DL systems either rely on manual efforts to make distributed training plans or apply parallelism combinations within a very limited search space. In this approach, we propose Galvatron, a new system framework that incorporates multiple popular parallelism dimensions and automatically finds the most efficient hybrid parallelism strategy. To better explore such a rarely huge search space, we 1) involve a decision tree to make decomposition and pruning based on some reasonable intuitions, and then 2) design a dynamic programming search algorithm to generate the optimal plan. Evaluations on four representative Transformer workloads show that Galvatron could perform automatically distributed training with different GPU memory budgets. Among all evluated scenarios, Galvatron always achieves superior system throughput compared to previous work with limited parallelism.

Keywords

gpu computing parallel algorithm large language model training

Cite

@article{arxiv.2211.13878,
  title  = {Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism},
  author = {Xupeng Miao and Yujie Wang and Youhe Jiang and Chunan Shi and Xiaonan Nie and Hailin Zhang and Bin Cui},
  journal= {arXiv preprint arXiv:2211.13878},
  year   = {2022}
}

Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

Abstract

Keywords

Cite

Related papers