Related papers: A Fast and Generic GPU-Based Parallel Reduction Im…

PAGANI: A Parallel Adaptive GPU Algorithm for Numerical

We present a new adaptive parallel algorithm for the challenging problem of multi-dimensional numerical integration on massively parallel architectures. Adaptive algorithms have demonstrated the best performance, but efficient many-core…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-24 Ioannis Sakiotis , Kamesh Arumugam , Marc Paterno , Desh Ranjan , Balša Terzić , Mohammad Zubair

Experimenting with Constraint Programming on GPU

The focus of my PhD thesis is on exploring parallel approaches to efficiently solve problems modeled by constraints and presenting a new proposal. Current solvers are very advanced; they are carefully designed to effectively manage the…

Artificial Intelligence · Computer Science 2019-09-23 Fabio Tardivo

Analyzing GPU Tensor Core Potential for Fast Reductions

The Nvidia GPU architecture has introduced new computing elements such as the \textit{tensor cores}, which are special processing units dedicated to perform fast matrix-multiply-accumulate (MMA) operations and accelerate \textit{Deep…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-03-12 Roberto Carrasco , Raimundo Vega , Cristóbal A. Navarro

Compiler-Assisted Workload Consolidation For Efficient Dynamic Parallelism on GPU

GPUs have been widely used to accelerate computations exhibiting simple patterns of parallelism - such as flat or two-level parallelism - and a degree of parallelism that can be statically determined based on the size of the input dataset.…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-18 Hancheng Wu , Da Li , Michela Becchi

A Variant of Concurrent Constraint Programming on GPU

The number of cores on graphical computing units (GPUs) is reaching thousands nowadays, whereas the clock speed of processors stagnates. Unfortunately, constraint programming solvers do not take advantage yet of GPU parallelism. One reason…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-07-26 Pierre Talbot , Frédéric Pinel , Pascal Bouvry

Analysis of GPU Parallel Computing based on Matlab

Matlab is very widely used in scientific computing, but Matlab computational efficiency is lower than C language program. In order to improve the computing speed, some toolbox can use GPU to accelerate the computation. This paper describes…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-05-26 Mingzhe Wang , Bo Wang , Qiu He , Xiuxiu Liu , Kunshuai Zhu

Parallel Sub-Structuring Methods for solving Sparse Linear Systems on a cluster of GPU

The main objective of this work consists in analyzing sub-structuring method for the parallel solution of sparse linear systems with matrices arising from the discretization of partial differential equations such as finite element, finite…

Numerical Analysis · Mathematics 2021-08-31 Abal-Kassim Cheik Ahamed , Frédéric Magoulès

Towards a Linear-Algebraic Hypervisor

Many techniques in program synthesis, superoptimization, and array programming require parallel rollouts of general-purpose programs. GPUs, while capable targets for domain-specific parallelism, are traditionally underutilized by such…

Programming Languages · Computer Science 2026-04-15 Breandan Considine

Accelerating Matrix Multiplication: A Performance Comparison Between Multi-Core CPU and GPU

Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-30 Mufakir Qamar Ansari , Mudabir Qamar Ansari

GPU-Acceleration of Parallel Unconditionally Stable Group Explicit Finite Difference Method

Graphics Processing Units (GPUs) are high performance co-processors originally intended to improve the use and quality of computer graphics applications. Once, researchers and practitioners noticed the potential of using GPU for general…

Numerical Analysis · Computer Science 2016-07-12 K. Parand , Saeed Zafarvahedian , Sayyed A. Hossayni

From Sequential Nodes to GPU Batches: Parallel Branch and Bound for Optimal $k$-Sparse GLMs

GPUs have significantly accelerated first-order methods for large-scale optimization, especially in continuous optimization. However, this success has not transferred cleanly to problems with discrete variables, combinatorial structure, and…

Machine Learning · Computer Science 2026-05-22 Jiachang Liu , Andrea Lodi

A parallel pattern for iterative stencil + reduce

We advocate the Loop-of-stencil-reduce pattern as a means of simplifying the implementation of data-parallel programs on heterogeneous multi-core platforms. Loop-of-stencil-reduce is general enough to subsume map, reduce, map-reduce,…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-09-16 M. Aldinucci , M. Danelutto , M. Drocco , P. Kilpatrick , C. Misale , G. Peretti Pezzi , M. Torquati

AlSub: Fully Parallel and Modular Subdivision

In recent years, mesh subdivision---the process of forging smooth free-form surfaces from coarse polygonal meshes---has become an indispensable production instrument. Although subdivision performance is crucial during simulation, animation…

Graphics · Computer Science 2019-01-17 Daniel Mlakar , Martin Winter , Hans-Peter Seidel , Markus Steinberger , Rhaleb Zayer

Improving the performance of the linear systems solvers using CUDA

Parallel computing can offer an enormous advantage regarding the performance for very large applications in almost any field: scientific computing, computer vision, databases, data mining, and economics. GPUs are high performance many-core…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-11-24 Bogdan Oancea , Tudorel Andrei , Raluca Mariana Dragoescu

GPU Tensor Cores for fast Arithmetic Reductions

This work proposes a GPU tensor core approach that encodes the arithmetic reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-17 Cristóbal A. Navarro , Roberto Carrasco , Ricardo J. Barrientos , Javier A. Riquelme , Raimundo Vega

Gaussian Process Models with Parallelization and GPU acceleration

In this work, we present an extension of Gaussian process (GP) models with sophisticated parallelization and GPU acceleration. The parallelization scheme arises naturally from the modular computational structure w.r.t. datapoints in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-10-21 Zhenwen Dai , Andreas Damianou , James Hensman , Neil Lawrence

A Compiler Framework for Optimizing Dynamic Parallelism on GPUs

Dynamic parallelism on GPUs allows GPU threads to dynamically launch other GPU threads. It is useful in applications with nested parallelism, particularly where the amount of nested parallelism is irregular and cannot be predicted…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-01-11 Mhd Ghaith Olabi , Juan Gómez Luna , Onur Mutlu , Wen-mei Hwu , Izzat El Hajj

Parallel-in-Time Nonlinear Optimal Control via GPU-native Sequential Convex Programming

Real-time trajectory optimization for nonlinear constrained autonomous systems is critical and typically performed by CPU-based sequential solvers. Specifically, reliance on global sparse linear algebra or the serial nature of dynamic…

Robotics · Computer Science 2026-03-13 Yilin Zou , Zhong Zhang , Maxime Robic , Fanghua Jiang

A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture

Parallel computing is a standard approach to achieving high-performance computing (HPC). Three commonly used methods to implement parallel computing include: 1) applying multithreading technology on single-core or multi-core CPUs; 2)…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-18 Xinyao Yi

Efficient AllReduce with Stragglers

Distributed machine learning workloads use data and tensor parallelism for training and inference, both of which rely on the AllReduce collective to synchronize gradients or activations. However, AllReduce algorithms are delayed by the…

Machine Learning · Computer Science 2025-09-30 Arjun Devraj , Eric Ding , Abhishek Vijaya Kumar , Robert Kleinberg , Rachee Singh