Related papers: FastFlow: Efficient Parallel Streaming Application…
FastFlow is a structured parallel programming framework targeting shared memory multicores. Its layered design and the optimized implementation of the communication mechanisms used to implement the FastFlow streaming networks provided to…
FastFlow is a programming environment specifically targeting cache-coherent shared-memory multi-cores. FastFlow is implemented as a stack of C++ template libraries built on top of lock-free (fence-free) synchronization mechanisms. In this…
FPGA programming is more complex as compared to Central Processing Units (CPUs) and Graphics Processing Units (GPUs). The coding languages to define the abstraction of Register Transfer Level (RTL) in High Level Synthesis (HLS) for FPGA…
TensorFlow is a popular emerging open-source programming framework supporting the execution of distributed applications on heterogeneous hardware. While TensorFlow has been initially designed for developing Machine Learning (ML)…
Many modern applications require real-time processing of large volumes of high-speed data. Such data processing needs can be modeled as a streaming computation. A streaming computation is specified as a dataflow graph that exposes multiple…
To increase performance and efficiency, systems use FPGAs as reconfigurable accelerators. A key challenge in designing these systems is partitioning computation between processors and an FPGA. An appropriate division of labor may be…
Pipeline is a fundamental parallel programming pattern. Mainstream pipeline programming frameworks count on data abstractions to perform pipeline scheduling. This design is convenient for data-centric pipeline applications but inefficient…
Multicore parallel programming has some very difficult problems such as deadlocks during synchronizations and race conditions brought by concurrency. Added to the difficulty is the lack of a simple, well-accepted computing model for…
Transformers are central to advances in artificial intelligence (AI), excelling in fields ranging from computer vision to natural language processing. Despite their success, their large parameter count and computational demands challenge…
Efficient execution of deep learning workloads on dataflow architectures is crucial for overcoming memory bottlenecks and maximizing performance. While streaming intermediate results between computation kernels can significantly improve…
Nowadays, latency-critical, high-performance applications are parallelized even on power-constrained client systems to improve performance. However, an important scenario of fine-grained tasking on simultaneous multithreading CPU cores in…
Recent data stream processing systems (DSPSs) can achieve excellent performance when processing large volumes of data under tight latency constraints. However, they sacrifice support for concurrent state access that eases the burden of…
Real-time LLM interactions demand streamed token generations, where text tokens are progressively generated and delivered to users while balancing two objectives: responsiveness (i.e., low time-to-first-token) and steady generation…
Comprehending the performance bottlenecks at the core of the intricate hardware-software interactions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatically asynchronous MPI…
In GPU-accelerated data analytics, the overhead of data transfer from CPU to GPU becomes a performance bottleneck when the data scales beyond GPU memory capacity due to the limited PCIe bandwidth. Data compression has come to rescue for…
The whole computer hardware industry embraced multicores. For these machines, the extreme optimisation of sequential algorithms is no longer sufficient to squeeze the real machine power, which can be only exploited via thread-level…
The aim of parallel computing is to increase an application performance by executing the application on multiple processors. OpenMP is an API that supports multi platform shared memory programming model and shared-memory programs are…
Taskflow aims to streamline the building of parallel and heterogeneous applications using a lightweight task graph-based approach. Taskflow introduces an expressive task graph programming model to assist developers in the implementation of…
Parallel computing is very important to accelerate the performance of software systems. Additionally, considering that a recurring challenge is to process high data volumes continuously, stream processing emerged as a paradigm and software…
There is increasing interest in using multicore processors to accelerate stream processing. For example, indexing sliding window content to enhance the performance of streaming queries is greatly improved by utilizing the computational…