Related papers: Pipeflow: An Efficient Task-Parallel Pipeline Prog…

Concurrent CPU-GPU Task Programming using Modern C++

In this paper, we introduce Heteroflow, a new C++ library to help developers quickly write parallel CPU-GPU programs using task dependency graphs. Heteroflow leverages the power of modern C++ and task-based approaches to enable efficient…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-17 Tsung-Wei Huang , Yibo Lin

Taskflow: A Lightweight Parallel and Heterogeneous Task Graph Computing System

Taskflow aims to streamline the building of parallel and heterogeneous applications using a lightweight task graph-based approach. Taskflow introduces an expressive task graph programming model to assist developers in the implementation of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-08 Tsung-Wei Huang , Dian-Lun Lin , Chun-Xun Lin , Yibo Lin

FastFlow tutorial

FastFlow is a structured parallel programming framework targeting shared memory multicores. Its layered design and the optimized implementation of the communication mechanisms used to implement the FastFlow streaming networks provided to…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-04-25 Marco Aldinucci , Marco Danelutto , Massimo Torquati

Zero Bubble Pipeline Parallelism

Pipeline parallelism is one of the key components for large-scale distributed training, yet its efficiency suffers from pipeline bubbles which were deemed inevitable. In this work, we introduce a scheduling strategy that, to our knowledge,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-01-22 Penghui Qi , Xinyi Wan , Guangxing Huang , Min Lin

Extending TensorFlow's Semantics with Pipelined Execution

TensorFlow is a popular cloud computing framework that targets machine learning applications. It separates the specification of application logic (in a dataflow graph) from the execution of the logic. TensorFlow's native runtime executes…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-08-27 Sam Whitlock , James Larus , Edouard Bugnion

Accelerating sequential programs using FastFlow and self-offloading

FastFlow is a programming environment specifically targeting cache-coherent shared-memory multi-cores. FastFlow is implemented as a stack of C++ template libraries built on top of lock-free (fence-free) synchronization mechanisms. In this…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-02-26 Marco Aldinucci , Marco Danelutto , Peter Kilpatrick , Massimiliano Meneghin , Massimo Torquati

Pipeline Parallelism with Controllable Memory

Pipeline parallelism has been widely explored, but most existing schedules lack a systematic methodology. In this paper, we propose a framework to decompose pipeline schedules as repeating a building block, and show that the lifespan of the…

Machine Learning · Computer Science 2024-11-05 Penghui Qi , Xinyi Wan , Nyamdavaa Amar , Min Lin

Spinning Fast Iterative Data Flows

Parallel dataflow systems are a central part of most analytic pipelines for big data. The iterative nature of many analysis and machine learning algorithms, however, is still a challenge for current systems. While certain types of bulk…

Databases · Computer Science 2012-08-02 Stephan Ewen , Kostas Tzoumas , Moritz Kaufmann , Volker Markl

FastFlow: Efficient Parallel Streaming Applications on Multi-core

Shared memory multiprocessors come back to popularity thanks to rapid spreading of commodity multi-core architectures. As ever, shared memory programs are fairly easy to write and quite hard to optimise; providing multi-core programmers…

Distributed, Parallel, and Cluster Computing · Computer Science 2009-09-10 Marco Aldinucci , Massimo Torquati , Massimiliano Meneghin

Transparent Synchronous Dataflow

Dataflow programming is a popular and convenient programming paradigm in systems modelling, optimisation, and machine learning. It has a number of advantages, for instance the lacks of control flow allows computation to be carried out in…

Programming Languages · Computer Science 2021-03-03 Steven W. T. Cheung , Dan R. Ghica , Koko Muroya

SiPipe: Bridging the CPU-GPU Utilization Gap for Efficient Pipeline-Parallel LLM Inference

As inference workloads for large language models (LLMs) scale to meet growing user demand, pipeline parallelism (PP) has become a widely adopted strategy for multi-GPU deployment, particularly in cross-node setups, to improve key-value (KV)…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-30 Yongchao He , Bohan Zhao , Zheng Cao

Efficient Pipeline Planning for Expedited Distributed DNN Training

To train modern large DNN models, pipeline parallelism has recently emerged, which distributes the model across GPUs and enables different devices to process different microbatches in pipeline. Earlier pipeline designs allow multiple…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-23 Ziyue Luo , Xiaodong Yi , Guoping Long , Shiqing Fan , Chuan Wu , Jun Yang , Wei Lin

Labyrinth: Compiling Imperative Control Flow to Parallel Dataflows

Parallel dataflow systems have become a standard technology for large-scale data analytics. Complex data analysis programs in areas such as machine learning and graph analytics often involve control flow, i.e., iterations and branching.…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-10-16 Gábor E. Gévay , Tilmann Rabl , Sebastian Breß , Loránd Madai-Tahy , Volker Markl

OptPipe: Memory- and Scheduling-Optimized Pipeline Parallelism for LLM Training

Pipeline parallelism (PP) has become a standard technique for scaling large language model (LLM) training across multiple devices. However, despite recent progress in reducing memory consumption through activation offloading, existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-08 Hongpei Li , Han Zhang , Huikang Liu , Dongdong Ge , Yinyu Ye

StreamFlow: cross-breeding cloud with HPC

Workflows are among the most commonly used tools in a variety of execution environments. Many of them target a specific environment; few of them make it possible to execute an entire workflow in different environments, e.g. Kubernetes and…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-09 Iacopo Colonnelli , Barbara Cantalupo , Ivan Merelli , Marco Aldinucci

Breadth-First Pipeline Parallelism

We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-07-10 Joel Lamy-Poirier

Closing the Performance Gap with Modern C++

On the way to Exascale, programmers face the increasing challenge of having to support multiple hardware architectures from the same code base. At the same time, portability of code and performance are increasingly difficult to achieve as…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-14 Thomas Heller , Hartmut Kaiser , Patrick Diehl , Dietmar Fey , Marc Alexander Schweitzer

TaskUniVerse: A Task-Based Unified Interface for Versatile Parallel Execution

Task based parallel programming has shown competitive outcomes in many aspects of parallel programming such as efficiency, performance, productivity and scalability. Different approaches are used by different software development frameworks…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-05-09 Afshin Zafari

StreamBlocks: A compiler for heterogeneous dataflow computing (technical report)

To increase performance and efficiency, systems use FPGAs as reconfigurable accelerators. A key challenge in designing these systems is partitioning computation between processors and an FPGA. An appropriate division of labor may be…

Hardware Architecture · Computer Science 2021-07-21 Endri Bezati , Mahyar Emami , Jörn Janneck , James Larus

Bind: a Partitioned Global Workflow Parallel Programming Model

High Performance Computing is notorious for its long and expensive software development cycle. To address this challenge, we present Bind: a "partitioned global workflow" parallel programming model for C++ applications that enables quick…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-06-16 Alex Kosenkov , Matthias Troyer