English

Composing Distributed Computations Through Task and Kernel Fusion

Distributed, Parallel, and Cluster Computing 2024-12-17 v2

Abstract

We introduce Diffuse, a system that dynamically performs task and kernel fusion in distributed, task-based runtime systems. The key component of Diffuse is an intermediate representation of distributed computation that enables the necessary analyses for the fusion of distributed tasks to be performed in a scalable manner. We pair task fusion with a JIT compiler to fuse together the kernels within fused tasks. We show empirically that Diffuse's intermediate representation is general enough to be a target for two real-world, task-based libraries (cuNumeric and Legate Sparse), letting Diffuse find optimization opportunities across function and library boundaries. Diffuse accelerates unmodified applications developed by composing task-based libraries by 1.86x on average (geo-mean), and by between 0.93x--10.7x on up to 128 GPUs. Diffuse also finds optimization opportunities missed by the original application developers, enabling high-level Python programs to match or exceed the performance of an explicitly parallel MPI library.

Keywords

Cite

@article{arxiv.2406.18109,
  title  = {Composing Distributed Computations Through Task and Kernel Fusion},
  author = {Rohan Yadav and Shiv Sundram and Wonchan Lee and Michael Garland and Michael Bauer and Alex Aiken and Fredrik Kjolstad},
  journal= {arXiv preprint arXiv:2406.18109},
  year   = {2024}
}