Related papers: TREES: A CPU/GPU Task-Parallel Runtime with Explic…

A NUMA-Aware Provably-Efficient Task-Parallel Platform Based on the Work-First Principle

Task parallelism is designed to simplify the task of parallel programming. When executing a task parallel program on modern NUMA architectures, it can fail to scale due to the phenomenon called work inflation, where the overall processing…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-01-08 Justin Deters , Jiaye Wu , Yifan Xu , I-Ting Angelina Lee

Fast Merge Tree Computation via SYCL

A merge tree is a topological descriptor of a real-valued function. Merge trees are used in visualization and topological data analysis, either directly or as a means to another end: computing a 0-dimensional persistence diagram,…

Computational Geometry · Computer Science 2023-01-31 Arnur Nigmetov , Dmitriy Morozov

Parallel scheduling of task trees with limited memory

This paper investigates the execution of tree-shaped task graphs using multiple processors. Each edge of such a tree represents some large data. A task can only be executed if all input and output data fit into memory, and a data can only…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-10-02 Lionel Eyraud-Dubois , Loris Marchal , Oliver Sinnen , Frédéric Vivien

Exploring Fine-grained Task Parallelism on Simultaneous Multithreading Cores

Nowadays, latency-critical, high-performance applications are parallelized even on power-constrained client systems to improve performance. However, an important scenario of fine-grained tasking on simultaneous multithreading CPU cores in…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-03 Denis Los , Igor Petushkov

Parallelizing Maximal Clique Enumeration on GPUs

We present a GPU solution for exact maximal clique enumeration (MCE) that performs a search tree traversal following the Bron-Kerbosch algorithm. Prior works on parallelizing MCE on GPUs perform a breadth-first traversal of the tree, which…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-25 Mohammad Almasri , Yen-Hsiang Chang , Izzat El Hajj , Rakesh Nagi , Jinjun Xiong , Wen-mei Hwu

Parallelizing Workload Execution in Embedded and High-Performance Heterogeneous Systems

In this paper, we introduce a software-defined framework that enables the parallel utilization of all the programmable processing resources available in heterogeneous system-on-chip (SoC) including FPGA-based hardware accelerators and…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-02-12 Jose Nunez-Yanez , Mohammad Hosseinabady , Moslem Amiri , Andrés Rodríguez , Rafael Asenjo , Angeles Navarro , Rubén Gran-Tejero , Darío Suárez-Gracia

An Empirical-cum-Statistical Approach to Power-Performance Characterization of Concurrent GPU Kernels

Growing deployment of power and energy efficient throughput accelerators (GPU) in data centers demands enhancement of power-performance co-optimization capabilities of GPUs. Realization of exascale computing using accelerators requires…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-06 Nilanjan Goswami , Amer Qouneh , Chao Li , Tao Li

Accelerating Monte-Carlo Tree Search on CPU-FPGA Heterogeneous Platform

Monte Carlo Tree Search (MCTS) methods have achieved great success in many Artificial Intelligence (AI) benchmarks. The in-tree operations become a critical performance bottleneck in realizing parallel MCTS on CPUs. In this work, we develop…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-25 Yuan Meng , Rajgopal Kannan , Viktor Prasanna

Co-Scheduling Algorithms for High-Throughput Workload Execution

This paper investigates co-scheduling algorithms for processing a set of parallel applications. Instead of executing each application one by one, using a maximum degree of parallelism for each of them, we aim at scheduling several…

Data Structures and Algorithms · Computer Science 2013-05-01 Guillaume Aupy , Manu Shantharam , Anne Benoit , Yves Robert , Padma Raghavan

Towards Green Computing: A Survey of Performance and Energy Efficiency of Different Platforms using OpenCL

When considering different hardware platforms, not just the time-to-solution can be of importance but also the energy necessary to reach it. This is not only the case with battery powered and mobile devices but also with high-performance…

Performance · Computer Science 2020-06-30 Philip Heinisch , Katharina Ostaszewski , Hendrik Ranocha

Concurrent Scheduling of High-Level Parallel Programs on Multi-GPU Systems

Parallel programming models can encourage performance portability by moving the responsibility for work assignment and data distribution from the programmer to a runtime system. However, analyzing the resulting implicit memory allocations,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-14 Fabian Knorr , Philip Salzmann , Peter Thoman , Thomas Fahringer

A Parallel CPU-GPU Framework for Batching Heuristic Operations in Depth-First Heuristic Search

The rapid advancement of GPU technology has unlocked powerful parallel processing capabilities, creating new opportunities to enhance classic search algorithms. This hardware has been exploited in best-first search algorithms with neural…

Artificial Intelligence · Computer Science 2025-11-18 Ehsan Futuhi , Nathan R. Sturtevant

Unleashing the Power of Preemptive Priority-based Scheduling for Real-Time GPU Tasks

Scheduling real-time tasks that utilize GPUs with analyzable guarantees poses a significant challenge due to the intricate interaction between CPU and GPU resources, as well as the complex GPU hardware and software stack. While much…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-01-31 Yidi Wang , Cong Liu , Daniel Wong , Hyoseung Kim

Implementation of a Parallel Tree Method on a GPU

The kd-tree is a fundamental tool in computer science. Among other applications, the application of kd-tree search (by the tree method) to the fast evaluation of particle interactions and neighbor search is highly important, since the…

Instrumentation and Methods for Astrophysics · Physics 2011-12-21 Naohito Nakasato

A Comparative Study of Asynchronous Many-Tasking Runtimes: Cilk, Charm++, ParalleX and AM++

We evaluate and compare four contemporary and emerging runtimes for high-performance computing(HPC) applications: Cilk, Charm++, ParalleX and AM++. We compare along three bases: programming model, execution model and the implementation on…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-04-02 Abhishek Kulkarni , Andrew Lumsdaine

Toward the Design of Fault-Tolerance- and Peak- Power-Aware Multi-Core Mixed-Criticality Systems

Mixed-Criticality (MC) systems have recently been devised to address the requirements of real-time systems in industrial applications, where the system runs tasks with different criticality levels on a single platform. In some workloads, a…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-01 Behnaz Ranjbar , Ali Hosseinghorban , Mohammad Salehi , Alireza Ejlali , Akash Kumar

GPU First -- Execution of Legacy CPU Codes on GPUs

Utilizing GPUs is critical for high performance on heterogeneous systems. However, leveraging the full potential of GPUs for accelerating legacy CPU applications can be a challenging task for developers. The porting process requires…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-27 Shilei Tian , Tom Scogland , Barbara Chapman , Johannes Doerfert

Cooperative Kernels: GPU Multitasking for Blocking Algorithms (Extended Version)

There is growing interest in accelerating irregular data-parallel algorithms on GPUs. These algorithms are typically blocking, so they require fair scheduling. But GPU programming models (e.g.\ OpenCL) do not mandate fair scheduling, and…

Programming Languages · Computer Science 2017-07-10 Tyler Sorensen , Hugues Evrard , Alastair F. Donaldson

Specx: a C++ task-based runtime system for heterogeneous distributed architectures

Parallelization is needed everywhere, from laptops and mobile phones to supercomputers. Among parallel programming models, task-based programming has demonstrated a powerful potential and is widely used in high-performance scientific…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-18 Paul Cardosi , Bérenger Bramas

Optimizing Fine-Grained Parallelism Through Dynamic Load Balancing on Multi-Socket Many-Core Systems

Achieving efficient task parallelism on many-core architectures is an important challenge. The widely used GNU OpenMP implementation of the popular OpenMP parallel programming model incurs high overhead for fine-grained, short-running tasks…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-20 Wenyi Wang , Maxime Gonthier , Poornima Nookala , Haochen Pan , Ian Foster , Ioan Raicu , Kyle Chard