Related papers: Accelerating Concurrent Heap on GPUs

A parallel priority queue with fast updates for GPU architectures

The single-source shortest path (SSSP) problem is a well-studied problem that is used in many applications. In the parallel setting, a work-efficient algorithm that additionally attains $o(n)$ parallel depth has been elusive. Alternatively,…

Data Structures and Algorithms · Computer Science 2023-05-15 Kyle Berney , John Iacono , Ben Karsin , Nodari Sitchinava

Accelerating the Convex Hull Computation with a Parallel GPU Algorithm

The convex hull is a fundamental geometrical structure for many applications where groups of points must be enclosed or represented by a convex polygon. Although efficient sequential convex hull algorithms exist, and are constantly being…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-27 Alan Keith , Héctor Ferrada , Cristóbal A. Navarro

An Empirical Study of Cache-Oblivious Priority Queues and their Application to the Shortest Path Problem

In recent years the Cache-Oblivious model of external memory computation has provided an attractive theoretical basis for the analysis of algorithms on massive datasets. Much progress has been made in discovering algorithms that are…

Data Structures and Algorithms · Computer Science 2008-02-08 Benjamin Sach , Raphaël Clifford

Gaussian Process Models with Parallelization and GPU acceleration

In this work, we present an extension of Gaussian process (GP) models with sophisticated parallelization and GPU acceleration. The parallelization scheme arises naturally from the modular computational structure w.r.t. datapoints in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-10-21 Zhenwen Dai , Andreas Damianou , James Hensman , Neil Lawrence

Performance Comparison on Parallel CPU and GPU Algorithms for Unified Gas-Kinetic Scheme

Parallel algorithms on CPU and GPU are implemented for the Unified Gas-Kinetic Scheme and their performances are investigated and compared by a two dimensional channel flow case. The parallel CPU algorithm has a one dimensional block…

Computational Physics · Physics 2018-11-02 Jizhou Liu , Fang Q. Hu , Xiaodong Li

Towards a Linear-Algebraic Hypervisor

Many techniques in program synthesis, superoptimization, and array programming require parallel rollouts of general-purpose programs. GPUs, while capable targets for domain-specific parallelism, are traditionally underutilized by such…

Programming Languages · Computer Science 2026-04-15 Breandan Considine

Efficient and High-quality Sparse Graph Coloring on the GPU

Graph coloring has been broadly used to discover concurrency in parallel computing. To speedup graph coloring for large-scale datasets, parallel algorithms have been proposed to leverage modern GPUs. Existing GPU implementations either have…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-22 Xuhao Chen , Pingfan Li , Jianbin Fang , Tao Tang , Zhiying Wang , Canqun Yang

Exploring the Limits of GPUs With Parallel Graph Algorithms

In this paper, we explore the limits of graphics processors (GPUs) for general purpose parallel computing by studying problems that require highly irregular data access patterns: parallel graph algorithms for list ranking and connected…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-02-25 Frank Dehne , Kumanan Yogaratnam

GPU-Resident Gaussian Process Regression Leveraging Asynchronous Tasks with HPX

Gaussian processes (GPs) are a widely used regression tool, but the cubic complexity of exact solvers limits their scalability. To address this challenge, we extend the GPRat library by incorporating a fully GPU-resident GP prediction…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-24 Henrik Möllmann , Dirk Pflüger , Alexander Strack

A Fast and Generic GPU-Based Parallel Reduction Implementation

Reduction operations are extensively employed in many computational problems. A reduction consists of, given a finite set of numeric elements, combining into a single value all elements in that set, using for this a combiner function. A…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-10-23 Walid Jradi , Hugo do Nascimento , Wellington Martins

GPU acceleration of an iterative scheme for gas-kinetic model equations with memory reduction techniques

This paper presents a Graphics Processing Units (GPUs) acceleration method of an iterative scheme for gas-kinetic model equations. Unlike the previous GPU parallelization of explicit kinetic schemes, this work features a fast converging…

Computational Physics · Physics 2020-01-08 Lianhua Zhu , Peng Wang , Songze Chen , Zhaoli Guo , Yonghao Zhang

Compiler-Assisted Workload Consolidation For Efficient Dynamic Parallelism on GPU

GPUs have been widely used to accelerate computations exhibiting simple patterns of parallelism - such as flat or two-level parallelism - and a degree of parallelism that can be statically determined based on the size of the input dataset.…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-18 Hancheng Wu , Da Li , Michela Becchi

Accelerating Matrix Multiplication: A Performance Comparison Between Multi-Core CPU and GPU

Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-30 Mufakir Qamar Ansari , Mudabir Qamar Ansari

Using Hierarchical Parallelism to Accelerate the Solution of Many Small Partial Differential Equations

This paper presents efforts to improve the hierarchical parallelism of a two scale simulation code. Two methods to improve the GPU parallel performance were developed and compared. The first used the NVIDIA Multi-Process Service and the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-15 Jacob Merson , Mark S. Shephard

Concurrent Crossover for PDHG

First-order methods based on the PDHG algorithm have recently emerged as a viable option for efficiently solving large-scale linear programming problems. One highly desirable property of these methods is that they can make effective use of…

Optimization and Control · Mathematics 2025-10-29 Edward Rothberg

Parallelizing non-linear sequential models over the sequence length

Sequential models, such as Recurrent Neural Networks and Neural Ordinary Differential Equations, have long suffered from slow training due to their inherent sequential nature. For many years this bottleneck has persisted, as many thought…

Machine Learning · Computer Science 2024-01-17 Yi Heng Lim , Qi Zhu , Joshua Selfridge , Muhammad Firmansyah Kasim

GPUs as Storage System Accelerators

Massively multicore processors, such as Graphics Processing Units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-18 Samer Al-Kiswany , Abdullah Gharaibeh , Matei Ripeanu

A Variant of Concurrent Constraint Programming on GPU

The number of cores on graphical computing units (GPUs) is reaching thousands nowadays, whereas the clock speed of processors stagnates. Unfortunately, constraint programming solvers do not take advantage yet of GPU parallelism. One reason…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-07-26 Pierre Talbot , Frédéric Pinel , Pascal Bouvry

Massive Parallelization of Massive Sample-size Survival Analysis

Large-scale observational health databases are increasingly popular for conducting comparative effectiveness and safety studies of medical products. However, increasing number of patients poses computational challenges when fitting survival…

Computation · Statistics 2023-10-26 Jianxiao Yang , Martijn J. Schuemie , Xiang Ji , Marc A. Suchard

WarpSpeed: A High-Performance Library for Concurrent GPU Hash Tables

GPU hash tables are increasingly used to accelerate data processing, but their limited functionality restricts adoption in large-scale data processing applications. Current limitations include incomplete concurrency support and missing…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-24 Hunter McCoy , Prashant Pandey