Related papers: MKPipe: A Compiler Framework for Optimizing Multi-…

Exploring Thread Coarsening on FPGA

Over the past few years, there has been an increased interest in including FPGAs in data centers and high-performance computing clusters along with GPUs and other accelerators. As a result, it has become increasingly important to have a…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-14 Mostafa Eghbali Zarch , Reece Neff , Michela Becchi

Towards Automatic Transformation of Legacy Scientific Code into OpenCL for Optimal Performance on FPGAs

There is a large body of legacy scientific code written in languages like Fortran that is not optimised to get the best performance out of heterogeneous acceleration devices like GPUs and FPGAs, and manually porting such code into parallel…

Performance · Computer Science 2019-01-25 Wim Vanderbauwhede , Syed Waqar Nabi

MCompiler: A Synergistic Compilation Framework

This paper presents a meta-compilation framework, the MCompiler. The main idea is that different segments of a program can be compiled with different compilers/optimizers and combined into a single executable. The MCompiler can be used in a…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-05-31 Aniket Shivam , Alexandru Nicolau , Alexander V. Veidenbaum

Resource-Aware Just-in-Time OpenCL Compiler for Coarse-Grained FPGA Overlays

FPGA vendors have recently started focusing on OpenCL for FPGAs because of its ability to leverage the parallelism inherent to heterogeneous computing platforms. OpenCL allows programs running on a host computer to launch accelerator…

Hardware Architecture · Computer Science 2017-05-09 Abhishek Kumar Jain , Douglas L. Maskell , Suhaib A. Fahmy

Execution of Compound Multi-Kernel OpenCL Computations in Multi-CPU/Multi-GPU Environments

Current computational systems are heterogeneous by nature, featuring a combination of CPUs and GPUs. As the latter are becoming an established platform for high-performance computing, the focus is shifting towards the seamless programming…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-10-23 Fábio Soldado , Fernando Alexandre , Hervé Paulino

ComPar: Optimized Multi-Compiler for Automatic OpenMP S2S Parallelization

Parallelization schemes are essential in order to exploit the full benefits of multi-core architectures. In said architectures, the most comprehensive parallelization API is OpenMP. However, the introduction of correct and optimal OpenMP…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-05-28 Idan Mosseri , Lee-or Alon , Re'em Harel , Gal Oren

Towards making the most of NLP-based device mapping optimization for OpenCL kernels

Nowadays, we are living in an era of extreme device heterogeneity. Despite the high variety of conventional CPU architectures, accelerator devices, such as GPUs and FPGAs, also appear in the foreground exploding the pool of available…

Machine Learning · Computer Science 2022-08-31 Petros Vavaroutsos , Ioannis Oroutzoglou , Dimosthenis Masouros , Dimitrios Soudris

Automatic Multi-GPU Code Generation applied to Simulation of Electrical Machines

The electrical and electronic engineering has used parallel programming to solve its large scale complex problems for performance reasons. However, as parallel programming requires a non-trivial distribution of tasks and data, developers…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-07-05 Antonio Wendell De Oliveira Rodrigues , Frédéric Guyomarc'H , Jean-Luc Dekeyser , Yvonnick Le Menach

An Efficient Dispatcher for Large Scale GraphProcessing on OpenCL-based FPGAs

High parallel framework has been proved to be very suitable for graph processing. There are various work to optimize the implementation in FPGAs, a pipeline parallel device. The key to make use of the parallel performance of FPGAs is to…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-07-02 Chengbo Yang

Improving OpenCL Performance by Specializing Compiler Phase Selection and Ordering

Automatic compiler phase selection/ordering has traditionally been focused on CPUs and, to a lesser extent, FPGAs. We present experiments regarding compiler phase ordering specialization of OpenCL kernels targeting a GPU. We use iterative…

Performance · Computer Science 2018-10-25 Ricardo Nobre , Luís Reis , João M. P. Cardoso

Improving the Efficiency of OpenCL Kernels through Pipes

In an effort to lower the barrier to the adoption of FPGAs by a broader community, today major FPGA vendors offer compiler toolchains for OpenCL code. While using these toolchain allows porting existing code to FPGAs, ensuring performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-01-09 Mostafa Eghbali Zarch , Michela Becchi

ALCOP: Automatic Load-Compute Pipelining in Deep Learning Compiler for AI-GPUs

Pipelining between data loading and computation is a critical tensor program optimization for GPUs. In order to unleash the high performance of latest GPUs, we must perform a synergetic optimization of multi-stage pipelining across the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-09 Guyue Huang , Yang Bai , Liu Liu , Yuke Wang , Bei Yu , Yufei Ding , Yuan Xie

StreamBlocks: A compiler for heterogeneous dataflow computing (technical report)

To increase performance and efficiency, systems use FPGAs as reconfigurable accelerators. A key challenge in designing these systems is partitioning computation between processors and an FPGA. An appropriate division of labor may be…

Hardware Architecture · Computer Science 2021-07-21 Endri Bezati , Mahyar Emami , Jörn Janneck , James Larus

Enabling OpenMP Task Parallelism on Multi-FPGAs

FPGA-based hardware accelerators have received increasing attention mainly due to their ability to accelerate deep pipelined applications, thus resulting in higher computational performance and energy efficiency. Nevertheless, the amount of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-23 R. Nepomuceno , R. Sterle , G. Valarini , M. Pereira , H. Yviquel , G. Araujo

Patterns and Rewrite Rules for Systematic Code Generation (From High-Level Functional Patterns to High-Performance OpenCL Code)

Computing systems have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort.…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-02-10 Michel Steuwer , Christian Fensch , Christophe Dubach

OpenCLIPER: an OpenCL-based C++ Framework for Overhead-Reduced Medical Image Processing and Reconstruction on Heterogeneous Devices

Medical image processing is often limited by the computational cost of the involved algorithms. Whereas dedicated computing devices (GPUs in particular) exist and do provide significant efficiency boosts, they have an extra cost of use in…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-08-01 Federico Simmross-Wattenberg , Manuel Rodríguez-Cayetano , Javier Royuela-del-Val , Elena Martín-González , Elisa Moya-Sáez , Marcos Martín-Fernández , Carlos Alberola-López

Source Code Classification for Energy Efficiency in Parallel Ultra Low-Power Microcontrollers

The analysis of source code through machine learning techniques is an increasingly explored research topic aiming at increasing smartness in the software toolchain to exploit modern architectures in the best possible way. In the case of…

Machine Learning · Computer Science 2020-12-15 Emanuele Parisi , Francesco Barchi , Andrea Bartolini , Giuseppe Tagliavini , Andrea Acquaviva

Challenging Portability Paradigms: FPGA Acceleration Using SYCL and OpenCL

As the interest in FPGA-based accelerators for HPC applications increases, new challenges also arise, especially concerning different programming and portability issues. This paper aims to provide a snapshot of the current state of the FPGA…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-06 Manuel de Castro , Francisco J. andújar , Roberto R. Osorio , Rocío Carratalá-Sáez , Diego R. Llanos

SiPipe: Bridging the CPU-GPU Utilization Gap for Efficient Pipeline-Parallel LLM Inference

As inference workloads for large language models (LLMs) scale to meet growing user demand, pipeline parallelism (PP) has become a widely adopted strategy for multi-GPU deployment, particularly in cross-node setups, to improve key-value (KV)…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-30 Yongchao He , Bohan Zhao , Zheng Cao

ImageCL: An Image Processing Language for Performance Portability on Heterogeneous Systems

Modern computer systems typically conbine multicore CPUs with accelerators like GPUs for inproved performance and energy efficiency. However, these sys- tems suffer from poor performance portability, code tuned for one device must be…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-05-23 Thomas L. Falch , Anne C. Elster