Related papers: StreamBlocks: A compiler for heterogeneous dataflo…
Efficient execution of deep learning workloads on dataflow architectures is crucial for overcoming memory bottlenecks and maximizing performance. While streaming intermediate results between computation kernels can significantly improve…
We present a unified programming model for heterogeneous computing systems. Such systems integrate multiple computing accelerators and memory units to deliver higher performance than CPU-centric systems. Although heterogeneous systems have…
In recent years, heterogeneous computing has emerged as the vital way to increase computers? performance and energy efficiency by combining diverse hardware devices, such as Graphics Processing Units (GPUs) and Field Programmable Gate…
Parallel computing is very important to accelerate the performance of software systems. Additionally, considering that a recurring challenge is to process high data volumes continuously, stream processing emerged as a paradigm and software…
FPGAs are well-suited for dataflow architectures that process data in a streaming or pipelined manner, thus satisfying the high computational and communication demands of emerging applications. However, manually implementing an efficient…
Shared memory multiprocessors come back to popularity thanks to rapid spreading of commodity multi-core architectures. As ever, shared memory programs are fairly easy to write and quite hard to optimise; providing multi-core programmers…
FPGA programming is more complex as compared to Central Processing Units (CPUs) and Graphics Processing Units (GPUs). The coding languages to define the abstraction of Register Transfer Level (RTL) in High Level Synthesis (HLS) for FPGA…
FPGAs have found their way into data centers as accelerator cards, making reconfigurable computing more accessible for high-performance applications. At the same time, new high-level synthesis compilers like Xilinx Vitis and runtime…
As the landscape of deep neural networks evolves, heterogeneous dataflow accelerators, in the form of multi-core architectures or chiplet-based designs, promise more flexibility and higher inference performance through scalability. So far,…
Big data streaming applications require utilization of heterogeneous parallel computing systems, which may comprise multiple multi-core CPUs and many-core accelerating devices such as NVIDIA GPUs and Intel Xeon Phis. Programming such…
The emerging large-scale and data-hungry algorithms require the computations to be delegated from a central server to several worker nodes. One major challenge in the distributed computations is to tackle delays and failures caused by the…
Region proposal is critical for object detection while it usually poses a bottleneck in improving the computation efficiency on traditional control-flow architectures. We have observed region proposal tasks are potentially suitable for…
FastFlow is a structured parallel programming framework targeting shared memory multicores. Its layered design and the optimized implementation of the communication mechanisms used to implement the FastFlow streaming networks provided to…
Workflows are among the most commonly used tools in a variety of execution environments. Many of them target a specific environment; few of them make it possible to execute an entire workflow in different environments, e.g. Kubernetes and…
This paper presents a stream processor generator, called SPGen, for FPGA-based system-on-chip platforms. In our research project, we use an FPGA as a common platform for applications ranging from HPC to embedded/robotics computing.…
The use of FPGAs for efficient graph processing has attracted significant interest. Recent memory subsystem upgrades including the introduction of HBM in FPGAs promise to further alleviate memory bottlenecks. However, modern multi-channel…
In this paper, we introduce a software-defined framework that enables the parallel utilization of all the programmable processing resources available in heterogeneous system-on-chip (SoC) including FPGA-based hardware accelerators and…
We present DataFlow, a computational framework for building, testing, and deploying high-performance machine learning systems on unbounded time-series data. Traditional data science workflows assume finite datasets and require substantial…
High-level synthesis (HLS) has enabled the rapid development of custom hardware circuits for many software applications. However, developing high-performance hardware circuits using HLS is still a non-trivial task requiring expertise in…
Modern computer systems typically conbine multicore CPUs with accelerators like GPUs for inproved performance and energy efficiency. However, these sys- tems suffer from poor performance portability, code tuned for one device must be…