English

Using Pilot Systems to Execute Many Task Workloads on Supercomputers

Distributed, Parallel, and Cluster Computing 2018-07-31 v4

Abstract

High performance computing systems have historically been designed to support applications comprised of mostly monolithic, single-job workloads. Pilot systems decouple workload specification, resource selection, and task execution via job placeholders and late-binding. Pilot systems help to satisfy the resource requirements of workloads comprised of multiple tasks. RADICAL-Pilot (RP) is a modular and extensible Python-based pilot system. In this paper we describe RP's design, architecture and implementation, and characterize its performance. RP is capable of spawning more than 100 tasks/second and supports the steady-state execution of up to 16K concurrent tasks. RP can be used stand-alone, as well as integrated with other application-level tools as a runtime system.

Keywords

Cite

@article{arxiv.1512.08194,
  title  = {Using Pilot Systems to Execute Many Task Workloads on Supercomputers},
  author = {Andre Merzky and Matteo Turilli and Manuel Maldonado and Mark Santcroos and Shantenu Jha},
  journal= {arXiv preprint arXiv:1512.08194},
  year   = {2018}
}
R2 v1 2026-06-22T12:18:26.690Z