Related papers: Strategy Preserving Compilation for Parallel Funct…

Advanced Programming Platform for efficient use of Data Parallel Hardware

Graphics processing units (GPU) had evolved from a specialized hardware capable to render high quality graphics in games to a commodity hardware for effective processing blocks of data in a parallel schema. This evolution is particularly…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-03-26 Luis Cabellos

Generating Configurable Hardware from Parallel Patterns

In recent years the computing landscape has seen an in- creasing shift towards specialized accelerators. Field pro- grammable gate arrays (FPGAs) are particularly promising as they offer significant performance and energy improvements…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-11-24 Raghu Prabhakar , David Koeplinger , Kevin Brown , HyoukJoong Lee , Christopher De Sa , Christos Kozyrakis , Kunle Olukotun

GPU Implementation and Optimization of a Flexible MAP Decoder for Synchronization Correction

In this paper we present an optimized parallel implementation of a flexible MAP decoder for synchronization error correcting codes, supporting a very wide range of code sizes and channel conditions. On mid-range GPUs we demonstrate decoding…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-02-26 Johann A. Briffa

Expression Acceleration: Seamless Parallelization of Typed High-Level Languages

Efficient parallelization of algorithms on general-purpose GPUs is essential in many areas today. However, it is a non-trivial task for software engineers to utilize GPUs to improve the performance of high-level programs in general.…

Programming Languages · Computer Science 2024-07-09 Lars Hummelgren , John Wikman , Oscar Eriksson , Philipp Haller , David Broman

PAGANI: A Parallel Adaptive GPU Algorithm for Numerical

We present a new adaptive parallel algorithm for the challenging problem of multi-dimensional numerical integration on massively parallel architectures. Adaptive algorithms have demonstrated the best performance, but efficient many-core…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-24 Ioannis Sakiotis , Kamesh Arumugam , Marc Paterno , Desh Ranjan , Balša Terzić , Mohammad Zubair

An Evaluation of Massively Parallel Algorithms for DFA Minimization

We study parallel algorithms for the minimization of Deterministic Finite Automata (DFAs). In particular, we implement four different massively parallel algorithms on Graphics Processing Units (GPUs). Our results confirm the expectations…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-31 Jan Martens , Anton Wijs

Functional design of efficient and parallelizable combinatorial generators using convolution

The application of program transformation and algebraic methods to the development of efficient combinatorial optimization (CO) algorithms relies on an exhaustive combinatorial generator for the problem specification, followed by the fusion…

Discrete Mathematics · Computer Science 2026-05-29 Xi He , Max. A. Little

Towards Automatic Transformation of Legacy Scientific Code into OpenCL for Optimal Performance on FPGAs

There is a large body of legacy scientific code written in languages like Fortran that is not optimised to get the best performance out of heterogeneous acceleration devices like GPUs and FPGAs, and manually porting such code into parallel…

Performance · Computer Science 2019-01-25 Wim Vanderbauwhede , Syed Waqar Nabi

A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture

Parallel computing is a standard approach to achieving high-performance computing (HPC). Three commonly used methods to implement parallel computing include: 1) applying multithreading technology on single-core or multi-core CPUs; 2)…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-18 Xinyao Yi

A Language for Describing Optimization Strategies

Optimizing programs to run efficiently on modern parallel hardware is hard but crucial for many applications. The predominantly used imperative languages - like C or OpenCL - force the programmer to intertwine the code describing…

Programming Languages · Computer Science 2020-02-07 Bastian Hagedorn , Johannes Lenfers , Thomas Koehler , Sergei Gorlatch , Michel Steuwer

pocl: A Performance-Portable OpenCL Implementation

OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-23 Pekka Jääskeläinen , Carlos Sánchez de La Lama , Erik Schnetter , Kalle Raiskila , Jarmo Takala , Heikki Berg

Evaluating Massively Parallel Algorithms for DFA Minimisation, Equivalence Checking and Inclusion Checking

We study parallel algorithms for the minimisation and equivalence checking of Deterministic Finite Automata (DFAs). Regarding DFA minimisation, we implement four different massively parallel algorithms on Graphics Processing Units~(GPUs).…

Formal Languages and Automata Theory · Computer Science 2025-08-29 Jan Heemstra , Jan Martens , Anton Wijs

A Tool for Automatically Suggesting Source-Code Optimizations for Complex GPU Kernels

Future computing systems, from handhelds to supercomputers, will undoubtedly be more parallel and heterogeneous than todays systems to provide more performance and energy efficiency. Thus, GPUs are increasingly being used to accelerate…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-18 Saeed Taheri , Apan Qasem , Martin Burtscher

Conformal Computing: Algebraically connecting the hardware/software boundary using a uniform approach to high-performance computation for software and hardware applications

We present a systematic, algebraically based, design methodology for efficient implementation of computer programs optimized over multiple levels of the processor/memory and network hierarchy. Using a common formalism to describe the…

Mathematical Software · Computer Science 2008-03-18 Lenore R. Mullin , James E. Raynolds

Efficient Tree-Traversals: Reconciling Parallelism and Dense Data Representations

Recent work showed that compiling functional programs to use dense, serialized memory representations for recursive algebraic datatypes can yield significant constant-factor speedups for sequential programs. But serializing data in a…

Programming Languages · Computer Science 2021-07-02 Chaitanya Koparkar , Mike Rainey , Michael Vollmer , Milind Kulkarni , Ryan R. Newton

Parallel Computing Architectures for Robotic Applications: A Comprehensive Review

With the growing complexity and capability of contemporary robotic systems, the necessity of sophisticated computing solutions to efficiently handle tasks such as real-time processing, sensor integration, decision-making, and control…

Robotics · Computer Science 2025-09-09 Md Rafid Islam

Accelerating Mobile Language Model via Speculative Decoding and NPU-Coordinated Execution

Performing Retrieval-Augmented Generation (RAG) directly on mobile devices is promising for data privacy and responsiveness but is hindered by the architectural constraints of mobile NPUs. Specifically, current hardware struggles with the…

Computation and Language · Computer Science 2025-12-18 Zhiyang Chen , Daliang Xu , Haiyang Shen , Chiheng Lou , Mengwei Xu , Shangguang Wang , Xin Jin , Yun Ma

High-Performance Parallel Implementation of Genetic Algorithm on FPGA

Genetic Algorithms (GAs) are used to solve search and optimization problems in which an optimal solution can be found using an iterative process with probabilistic and non-deterministic transitions. However, depending on the problem's…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-01-23 Matheus F. Torquato , Marcelo A. C. Fernandes

A Transformation--Based Approach for the Design of Parallel/Distributed Scientific Software: the FFT

We describe a methodology for designing efficient parallel and distributed scientific software. This methodology utilizes sequences of mechanizable algebra--based optimizing transformations. In this study, we apply our methodology to the…

Software Engineering · Computer Science 2008-11-18 Harry B. Hunt , Lenore R. Mullin , Daniel J. Rosenkrantz , James E. Raynolds

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs

While parallelism remains the main source of performance, architectural implementations and programming models change with each new hardware generation, often leading to costly application re-engineering. Most tools for performance…

Programming Languages · Computer Science 2022-07-04 William S. Moses , Ivan R. Ivanov , Jens Domke , Toshio Endo , Johannes Doerfert , Oleksandr Zinenko