Related papers: FastFlow tutorial
FastFlow is a programming environment specifically targeting cache-coherent shared-memory multi-cores. FastFlow is implemented as a stack of C++ template libraries built on top of lock-free (fence-free) synchronization mechanisms. In this…
Shared memory multiprocessors come back to popularity thanks to rapid spreading of commodity multi-core architectures. As ever, shared memory programs are fairly easy to write and quite hard to optimise; providing multi-core programmers…
Pipeline is a fundamental parallel programming pattern. Mainstream pipeline programming frameworks count on data abstractions to perform pipeline scheduling. This design is convenient for data-centric pipeline applications but inefficient…
FPGA programming is more complex as compared to Central Processing Units (CPUs) and Graphics Processing Units (GPUs). The coding languages to define the abstraction of Register Transfer Level (RTL) in High Level Synthesis (HLS) for FPGA…
TensorFlow is a popular emerging open-source programming framework supporting the execution of distributed applications on heterogeneous hardware. While TensorFlow has been initially designed for developing Machine Learning (ML)…
Dataflow programming is a popular and convenient programming paradigm in systems modelling, optimisation, and machine learning. It has a number of advantages, for instance the lacks of control flow allows computation to be carried out in…
To increase performance and efficiency, systems use FPGAs as reconfigurable accelerators. A key challenge in designing these systems is partitioning computation between processors and an FPGA. An appropriate division of labor may be…
TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of…
Accelerator architectures specialize in executing SIMD (single instruction, multiple data) in lockstep. Because the majority of CUDA applications are parallelized loops, control flow information can provide an in-depth characterization of a…
We present a unified programming model for heterogeneous computing systems. Such systems integrate multiple computing accelerators and memory units to deliver higher performance than CPU-centric systems. Although heterogeneous systems have…
Taskflow aims to streamline the building of parallel and heterogeneous applications using a lightweight task graph-based approach. Taskflow introduces an expressive task graph programming model to assist developers in the implementation of…
Multicore parallel programming has some very difficult problems such as deadlocks during synchronizations and race conditions brought by concurrency. Added to the difficulty is the lack of a simple, well-accepted computing model for…
TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous…
Heterogeneous many-cores are now an integral part of modern computing systems ranging from embedding systems to supercomputers. While heterogeneous many-core design offers the potential for energy-efficient high-performance, such potential…
In this paper, we introduce Heteroflow, a new C++ library to help developers quickly write parallel CPU-GPU programs using task dependency graphs. Heteroflow leverages the power of modern C++ and task-based approaches to enable efficient…
In this paper we introduce vFlow - A framework for rapid designing of batch processing applications for Cloud Computing environment. vFlow batch processing system extracts tasks from the vPlans diagrams, systematically captures the dynamics…
Scikit-multiflow is a multi-output/multi-label and stream data mining framework for the Python programming language. Conceived to serve as a platform to encourage democratization of stream learning research, it provides multiple state of…
The whole computer hardware industry embraced multicores. For these machines, the extreme optimisation of sequential algorithms is no longer sufficient to squeeze the real machine power, which can be only exploited via thread-level…
Flow-matching models deliver state-of-the-art fidelity in image and video generation, but the inherent sequential denoising process renders them slower. Existing acceleration methods like distillation, trajectory truncation, and consistency…
Transformers are central to advances in artificial intelligence (AI), excelling in fields ranging from computer vision to natural language processing. Despite their success, their large parameter count and computational demands challenge…