Related papers: TREES: A CPU/GPU Task-Parallel Runtime with Explic…
Task parallelism is designed to simplify the task of parallel programming. When executing a task parallel program on modern NUMA architectures, it can fail to scale due to the phenomenon called work inflation, where the overall processing…
A merge tree is a topological descriptor of a real-valued function. Merge trees are used in visualization and topological data analysis, either directly or as a means to another end: computing a 0-dimensional persistence diagram,…
This paper investigates the execution of tree-shaped task graphs using multiple processors. Each edge of such a tree represents some large data. A task can only be executed if all input and output data fit into memory, and a data can only…
Nowadays, latency-critical, high-performance applications are parallelized even on power-constrained client systems to improve performance. However, an important scenario of fine-grained tasking on simultaneous multithreading CPU cores in…
We present a GPU solution for exact maximal clique enumeration (MCE) that performs a search tree traversal following the Bron-Kerbosch algorithm. Prior works on parallelizing MCE on GPUs perform a breadth-first traversal of the tree, which…
In this paper, we introduce a software-defined framework that enables the parallel utilization of all the programmable processing resources available in heterogeneous system-on-chip (SoC) including FPGA-based hardware accelerators and…
Growing deployment of power and energy efficient throughput accelerators (GPU) in data centers demands enhancement of power-performance co-optimization capabilities of GPUs. Realization of exascale computing using accelerators requires…
Monte Carlo Tree Search (MCTS) methods have achieved great success in many Artificial Intelligence (AI) benchmarks. The in-tree operations become a critical performance bottleneck in realizing parallel MCTS on CPUs. In this work, we develop…
This paper investigates co-scheduling algorithms for processing a set of parallel applications. Instead of executing each application one by one, using a maximum degree of parallelism for each of them, we aim at scheduling several…
When considering different hardware platforms, not just the time-to-solution can be of importance but also the energy necessary to reach it. This is not only the case with battery powered and mobile devices but also with high-performance…
Parallel programming models can encourage performance portability by moving the responsibility for work assignment and data distribution from the programmer to a runtime system. However, analyzing the resulting implicit memory allocations,…
The rapid advancement of GPU technology has unlocked powerful parallel processing capabilities, creating new opportunities to enhance classic search algorithms. This hardware has been exploited in best-first search algorithms with neural…
Scheduling real-time tasks that utilize GPUs with analyzable guarantees poses a significant challenge due to the intricate interaction between CPU and GPU resources, as well as the complex GPU hardware and software stack. While much…
The kd-tree is a fundamental tool in computer science. Among other applications, the application of kd-tree search (by the tree method) to the fast evaluation of particle interactions and neighbor search is highly important, since the…
We evaluate and compare four contemporary and emerging runtimes for high-performance computing(HPC) applications: Cilk, Charm++, ParalleX and AM++. We compare along three bases: programming model, execution model and the implementation on…
Mixed-Criticality (MC) systems have recently been devised to address the requirements of real-time systems in industrial applications, where the system runs tasks with different criticality levels on a single platform. In some workloads, a…
Utilizing GPUs is critical for high performance on heterogeneous systems. However, leveraging the full potential of GPUs for accelerating legacy CPU applications can be a challenging task for developers. The porting process requires…
There is growing interest in accelerating irregular data-parallel algorithms on GPUs. These algorithms are typically blocking, so they require fair scheduling. But GPU programming models (e.g.\ OpenCL) do not mandate fair scheduling, and…
Parallelization is needed everywhere, from laptops and mobile phones to supercomputers. Among parallel programming models, task-based programming has demonstrated a powerful potential and is widely used in high-performance scientific…
Achieving efficient task parallelism on many-core architectures is an important challenge. The widely used GNU OpenMP implementation of the popular OpenMP parallel programming model incurs high overhead for fine-grained, short-running tasks…