English

Saturn: Efficient Multi-Large-Model Deep Learning

Machine Learning 2023-11-07 v1 Artificial Intelligence Distributed, Parallel, and Cluster Computing

Abstract

In this paper, we propose Saturn, a new data system to improve the efficiency of multi-large-model training (e.g., during model selection/hyperparameter optimization). We first identify three key interconnected systems challenges for users building large models in this setting -- parallelism technique selection, distribution of GPUs over jobs, and scheduling. We then formalize these as a joint problem, and build a new system architecture to tackle these challenges simultaneously. Our evaluations show that our joint-optimization approach yields 39-49% lower model selection runtimes than typical current DL practice.

Keywords

Cite

@article{arxiv.2311.02840,
  title  = {Saturn: Efficient Multi-Large-Model Deep Learning},
  author = {Kabir Nagrecha and Arun Kumar},
  journal= {arXiv preprint arXiv:2311.02840},
  year   = {2023}
}

Comments

4 pages, 1 figure, 2 tables. Accepted to BayLearn 2023. Abstract of this paper: https://adalabucsd.github.io/papers/TR_2023_Saturn.pdf