Related papers: Temporal Vectorization: A Compiler Approach to Aut…

A DSP shared is a DSP earned: HLS Task-Level Multi-Pumping for High-Performance Low-Resource Designs

High-level synthesis (HLS) enhances digital hardware design productivity through a high abstraction level. Even if the HLS abstraction prevents fine-grained manual register-transfer level (RTL) optimizations, it also enables automatable…

Hardware Architecture · Computer Science 2024-01-01 Giovanni Brignone , Mihai T. Lazarescu , Luciano Lavagno

Memory-constrained Vectorization and Scheduling of Dataflow Graphs for Hybrid CPU-GPU Platforms

The increasing use of heterogeneous embedded systems with multi-core CPUs and Graphics Processing Units (GPUs) presents important challenges in effectively exploiting pipeline, task and data-level parallelism to meet throughput requirements…

Signal Processing · Electrical Eng. & Systems 2017-12-01 Shuoxin Lin , Jiahao Wu , Shuvra S. Bhattacharyya

DSL-based Design Space Exploration for Temporal and Spatial Parallelism of Custom Stream Computing

Stream computation is one of the approaches suitable for FPGA-based custom computing due to its high throughput capability brought by pipelining with regular memory access. To increase performance of iterative stream computation, we can…

Hardware Architecture · Computer Science 2015-09-02 Kentaro Sano

Code modernization strategies for short-range non-bonded molecular dynamics simulations

Modern HPC systems are increasingly relying on greater core counts and wider vector registers. Thus, applications need to be adapted to fully utilize these hardware capabilities. One class of applications that can benefit from this increase…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-16 James Vance , Zhen-Hao Xu , Nikita Tretyakov , Torsten Stuehn , Markus Rampp , Sebastian Eibl , Christoph Junghans , André Brinkmann

Rethinking LLM-Based RTL Code Optimization Via Timing Logic Metamorphosis

Register Transfer Level(RTL) code optimization is crucial for achieving high performance and low power consumption in digital circuit design. However, traditional optimization methods often rely on manual tuning and heuristics, which can be…

Software Engineering · Computer Science 2025-07-23 Zhihao Xu , Bixin Li , Lulu Wang

Scalable and Efficient Configuration of Time-Division Multiplexed Resources

Consumer-electronics systems are becoming increasingly complex as the number of integrated applications is growing. Some of these applications have real-time requirements, while other non-real-time applications only require good average…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-28 Anna Minaeva , Premysl Sucha , Benny Akesson , Zdenek Hanzalek

Parallelizing Workload Execution in Embedded and High-Performance Heterogeneous Systems

In this paper, we introduce a software-defined framework that enables the parallel utilization of all the programmable processing resources available in heterogeneous system-on-chip (SoC) including FPGA-based hardware accelerators and…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-02-12 Jose Nunez-Yanez , Mohammad Hosseinabady , Moslem Amiri , Andrés Rodríguez , Rafael Asenjo , Angeles Navarro , Rubén Gran-Tejero , Darío Suárez-Gracia

Redesigning OP2 Compiler to Use HPX Runtime Asynchronous Techniques

Maximizing parallelism level in applications can be achieved by minimizing overheads due to load imbalances and waiting time due to memory latencies. Compiler optimization is one of the most effective solutions to tackle this problem. The…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-03-29 Zahra Khatami , Hartmut Kaiser , J. Ramanujam

VLCs: Managing Parallelism with Virtualized Libraries

As the complexity and scale of modern parallel machines continue to grow, programmers increasingly rely on composition of software libraries to encapsulate and exploit parallelism. However, many libraries are not designed with composition…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-26 Yineng Yan , William Ruys , Hochan Lee , Ian Henriksen , Arthur Peters , Sean Stephens , Bozhi You , Henrique Fingler , Martin Burtscher , Milos Gligoric , Keshav Pingali , Mattan Erez , George Biros , Christopher J. Rossbach

The Vectorization of the Tersoff Multi-Body Potential: An Exercise in Performance Portability

Molecular dynamics simulations, an indispensable research tool in computational chemistry and materials science, consume a significant portion of the supercomputing cycles around the world. We focus on multi-body potentials and aim at…

Computational Engineering, Finance, and Science · Computer Science 2016-07-12 Markus Höhnerbach , Ahmed E. Ismail , Paolo Bientinesi

Inner Loop Optimizations in Mapping Single Threaded Programs to Hardware

In the context of mapping high-level algorithms to hardware, we consider the basic problem of generating an efficient hardware implementation of a single threaded program, in particular, that of an inner loop. We describe a control-flow…

Hardware Architecture · Computer Science 2014-11-05 Madhav Desai

Stencil-HMLS: A multi-layered approach to the automatic optimisation of stencil codes on FPGA

The challenges associated with effectively programming FPGAs have been a major blocker in popularising reconfigurable architectures for HPC workloads. However new compiler technologies, such as MLIR, are providing new capabilities which…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-04 Gabriel Rodriguez-Canal , Nick Brown , Maurice Jamieson , Emilien Bauer , Anton Lydike , Tobias Grosser

An LLM-Tool Compiler for Fused Parallel Function Calling

State-of-the-art sequential reasoning in Large Language Models (LLMs) has expanded the capabilities of Copilots beyond conversational tasks to complex function calling, managing thousands of API calls. However, the tendency of compositional…

Programming Languages · Computer Science 2024-05-29 Simranjit Singh , Andreas Karatzas , Michael Fore , Iraklis Anagnostopoulos , Dimitrios Stamoulis

Optimized Partitioning and Priority Assignment of Real-Time Applications on Heterogeneous Platforms with Hardware Acceleration

Hardware accelerators, such as those based on GPUs and FPGAs, offer an excellent opportunity to efficiently parallelize functionalities. Recently, modern embedded platforms started being equipped with such accelerators, resulting in a…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-16 Daniel Casini , Paolo Pazzaglia , Alessandro Biondi , Marco Di Natale

Coordinated Management of Processor Configuration and Cache Partitioning to Optimize Energy under QoS Constraints

An effective way to improve energy efficiency is to throttle hardware resources to meet a certain performance target, specified as a QoS constraint, associated with all applications running on a multicore system. Prior art has proposed…

Hardware Architecture · Computer Science 2019-11-14 Mehrzad Nejat , Madhavan Manivannan , Miquel Pericas , Per Stenstrom

Methods for Partitioning Data to Improve Parallel Execution Time for Sorting on Heterogeneous Clusters

The aim of the paper is to introduce general techniques in order to optimize the parallel execution time of sorting on a distributed architectures with processors of various speeds. Such an application requires a partitioning step. For…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-08-16 Christophe Cérin , Jean-Christophe Dubacq , Jean-Louis Roch , the SafeScale Collaboration

MKPipe: A Compiler Framework for Optimizing Multi-Kernel Workloads in OpenCL for FPGA

OpenCL for FPGA enables developers to design FPGAs using a programming model similar for processors. Recent works have shown that code optimization at the OpenCL level is important to achieve high computational efficiency. However, existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-06 Ji Liu , Abdullah-Al Kafi , Xipeng Shen , Huiyang Zhou

Leveraging shared caches for parallel temporal blocking of stencil codes on multicore processors and clusters

Bandwidth-starved multicore chips have become ubiquitous. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-06-17 Markus Wittmann , Georg Hager , Jan Treibig , Gerhard Wellein

Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics

Numerical simulations can help solve complex problems. Most of these algorithms are massively parallel and thus good candidates for FPGA acceleration thanks to spatial parallelism. Modern FPGA devices can leverage high-bandwidth memory…

Hardware Architecture · Computer Science 2022-11-09 Stephanie Soldavini , Karl F. A. Friebel , Mattia Tibaldi , Gerald Hempel , Jeronimo Castrillon , Christian Pilato

Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory

New algorithms and optimization techniques are needed to balance the accelerating trend towards bandwidth-starved multicore chips. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the…

Performance · Computer Science 2012-03-01 Markus Wittmann , Georg Hager , Gerhard Wellein