English
Related papers

Related papers: Temporal Vectorization: A Compiler Approach to Aut…

200 papers

High-level synthesis (HLS) enhances digital hardware design productivity through a high abstraction level. Even if the HLS abstraction prevents fine-grained manual register-transfer level (RTL) optimizations, it also enables automatable…

Hardware Architecture · Computer Science 2024-01-01 Giovanni Brignone , Mihai T. Lazarescu , Luciano Lavagno

The increasing use of heterogeneous embedded systems with multi-core CPUs and Graphics Processing Units (GPUs) presents important challenges in effectively exploiting pipeline, task and data-level parallelism to meet throughput requirements…

Signal Processing · Electrical Eng. & Systems 2017-12-01 Shuoxin Lin , Jiahao Wu , Shuvra S. Bhattacharyya

Stream computation is one of the approaches suitable for FPGA-based custom computing due to its high throughput capability brought by pipelining with regular memory access. To increase performance of iterative stream computation, we can…

Hardware Architecture · Computer Science 2015-09-02 Kentaro Sano

Modern HPC systems are increasingly relying on greater core counts and wider vector registers. Thus, applications need to be adapted to fully utilize these hardware capabilities. One class of applications that can benefit from this increase…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-16 James Vance , Zhen-Hao Xu , Nikita Tretyakov , Torsten Stuehn , Markus Rampp , Sebastian Eibl , Christoph Junghans , André Brinkmann

Register Transfer Level(RTL) code optimization is crucial for achieving high performance and low power consumption in digital circuit design. However, traditional optimization methods often rely on manual tuning and heuristics, which can be…

Software Engineering · Computer Science 2025-07-23 Zhihao Xu , Bixin Li , Lulu Wang

Consumer-electronics systems are becoming increasingly complex as the number of integrated applications is growing. Some of these applications have real-time requirements, while other non-real-time applications only require good average…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-28 Anna Minaeva , Premysl Sucha , Benny Akesson , Zdenek Hanzalek

In this paper, we introduce a software-defined framework that enables the parallel utilization of all the programmable processing resources available in heterogeneous system-on-chip (SoC) including FPGA-based hardware accelerators and…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-02-12 Jose Nunez-Yanez , Mohammad Hosseinabady , Moslem Amiri , Andrés Rodríguez , Rafael Asenjo , Angeles Navarro , Rubén Gran-Tejero , Darío Suárez-Gracia

Maximizing parallelism level in applications can be achieved by minimizing overheads due to load imbalances and waiting time due to memory latencies. Compiler optimization is one of the most effective solutions to tackle this problem. The…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-03-29 Zahra Khatami , Hartmut Kaiser , J. Ramanujam

As the complexity and scale of modern parallel machines continue to grow, programmers increasingly rely on composition of software libraries to encapsulate and exploit parallelism. However, many libraries are not designed with composition…

Molecular dynamics simulations, an indispensable research tool in computational chemistry and materials science, consume a significant portion of the supercomputing cycles around the world. We focus on multi-body potentials and aim at…

Computational Engineering, Finance, and Science · Computer Science 2016-07-12 Markus Höhnerbach , Ahmed E. Ismail , Paolo Bientinesi

In the context of mapping high-level algorithms to hardware, we consider the basic problem of generating an efficient hardware implementation of a single threaded program, in particular, that of an inner loop. We describe a control-flow…

Hardware Architecture · Computer Science 2014-11-05 Madhav Desai

The challenges associated with effectively programming FPGAs have been a major blocker in popularising reconfigurable architectures for HPC workloads. However new compiler technologies, such as MLIR, are providing new capabilities which…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-04 Gabriel Rodriguez-Canal , Nick Brown , Maurice Jamieson , Emilien Bauer , Anton Lydike , Tobias Grosser

State-of-the-art sequential reasoning in Large Language Models (LLMs) has expanded the capabilities of Copilots beyond conversational tasks to complex function calling, managing thousands of API calls. However, the tendency of compositional…

Programming Languages · Computer Science 2024-05-29 Simranjit Singh , Andreas Karatzas , Michael Fore , Iraklis Anagnostopoulos , Dimitrios Stamoulis

Hardware accelerators, such as those based on GPUs and FPGAs, offer an excellent opportunity to efficiently parallelize functionalities. Recently, modern embedded platforms started being equipped with such accelerators, resulting in a…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-16 Daniel Casini , Paolo Pazzaglia , Alessandro Biondi , Marco Di Natale

An effective way to improve energy efficiency is to throttle hardware resources to meet a certain performance target, specified as a QoS constraint, associated with all applications running on a multicore system. Prior art has proposed…

Hardware Architecture · Computer Science 2019-11-14 Mehrzad Nejat , Madhavan Manivannan , Miquel Pericas , Per Stenstrom

The aim of the paper is to introduce general techniques in order to optimize the parallel execution time of sorting on a distributed architectures with processors of various speeds. Such an application requires a partitioning step. For…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-08-16 Christophe Cérin , Jean-Christophe Dubacq , Jean-Louis Roch , the SafeScale Collaboration

OpenCL for FPGA enables developers to design FPGAs using a programming model similar for processors. Recent works have shown that code optimization at the OpenCL level is important to achieve high computational efficiency. However, existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-06 Ji Liu , Abdullah-Al Kafi , Xipeng Shen , Huiyang Zhou

Bandwidth-starved multicore chips have become ubiquitous. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-06-17 Markus Wittmann , Georg Hager , Jan Treibig , Gerhard Wellein

Numerical simulations can help solve complex problems. Most of these algorithms are massively parallel and thus good candidates for FPGA acceleration thanks to spatial parallelism. Modern FPGA devices can leverage high-bandwidth memory…

Hardware Architecture · Computer Science 2022-11-09 Stephanie Soldavini , Karl F. A. Friebel , Mattia Tibaldi , Gerald Hempel , Jeronimo Castrillon , Christian Pilato

New algorithms and optimization techniques are needed to balance the accelerating trend towards bandwidth-starved multicore chips. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the…

Performance · Computer Science 2012-03-01 Markus Wittmann , Georg Hager , Gerhard Wellein
‹ Prev 1 2 3 10 Next ›