PaPy: Parallel and Distributed Data-processing Pipelines in Python
Abstract
PaPy, which stands for parallel pipelines in Python, is a highly flexible framework that enables the construction of robust, scalable workflows for either generating or processing voluminous datasets. A workflow is created from user-written Python functions (nodes) connected by 'pipes' (edges) into a directed acyclic graph. These functions are arbitrarily definable, and can make use of any Python modules or external binaries. Given a user-defined topology and collection of input data, functions are composed into nested higher-order maps, which are transparently and robustly evaluated in parallel on a single computer or on remote hosts. Local and remote computational resources can be flexibly pooled and assigned to functional nodes, thereby allowing facile load-balancing and pipeline optimization to maximize computational throughput. Input items are processed by nodes in parallel, and traverse the graph in batches of adjustable size -- a trade-off between lazy-evaluation, parallelism, and memory consumption. The processing of a single item can be parallelized in a scatter/gather scheme. The simplicity and flexibility of distributed workflows using PaPy bridges the gap between desktop -> grid, enabling this new computing paradigm to be leveraged in the processing of large scientific datasets.
Cite
@article{arxiv.1407.4378,
title = {PaPy: Parallel and Distributed Data-processing Pipelines in Python},
author = {Marcin Cieslik and Cameron Mura},
journal= {arXiv preprint arXiv:1407.4378},
year = {2014}
}
Comments
7 pages, 5 figures, 2 tables, some use-cases; more at http://muralab.org/PaPy