Related papers: A Non-linear GPU Thread Map for Triangular Domains

Improving the GPU space of computation under triangular domain problems

There is a stage in the GPU computing pipeline where a grid of thread-blocks is mapped to the problem domain. Normally, this grid is a k-dimensional bounding box that covers a k-dimensional problem no matter its shape. Threads that fall…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-08-27 Cristobal A. Navarro , Nancy Hitschfeld

Efficient GPU Thread Mapping on Embedded 2D Fractals

This work proposes a new approach for mapping GPU threads onto a family of discrete embedded 2D fractals. A block-space map $\lambda: \mathbb{Z}_{\mathbb{E}}^{2} \mapsto \mathbb{Z}_{\mathbb{F}}^{2}$ is proposed, from Euclidean parallel…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-29 Cristóbal A. Navarro , Felipe A. Quezada , Nancy Hitschfeld , Raimundo Vega , Benjamin Bustos

Potential benefits of a block-space GPU approach for discrete tetrahedral domains

The study of data-parallel domain re-organization and thread-mapping techniques are relevant topics as they can increase the efficiency of GPU computations when working on spatial discrete domains with non-box-shaped geometry. In this work…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-06-30 Cristóbal A. Navarro , Benjamín Bustos , Nancy Hitschfeld

Possibilities of Recursive GPU Mapping for Discrete Orthogonal Simplices

The problem of parallel thread mapping is studied for the case of discrete orthogonal $m$-simplices. The possibility of a $O(1)$ time recursive block-space map $\lambda: \mathbb{Z}^m \mapsto \mathbb{Z}^m$ is analyzed from the point of view…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-10-25 Cristóbal A. Navarro , Benjamín Bustos , Nancy Hitscheld

A Scalable and Energy Efficient GPU Thread Map for m-Simplex Domains

This work proposes a new GPU thread map for $m$-simplex domains, that scales its speedup with dimension and is energy efficient compared to other state of the art approaches. The main contributions of this work are i) the formulation of the…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-13 Cristóbal A. Navarro , Felipe A. Quezada , Benjamin Bustos , Nancy Hitschfeld , Rolando Kindelan

Block-space GPU Mapping for Embedded Sierpi\'nski Gasket Fractals

This work studies the problem of GPU thread mapping for a Sierpi\'nski gasket fractal embedded in a discrete Euclidean space of $n \times n$. A block-space map $\lambda: \mathbb{Z}_{\mathbb{E}}^{2} \mapsto \mathbb{Z}_{\mathbb{F}}^{2}$ is…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-06-15 Cristóbal A. Navarro , Benjamín Bustos , Raimundo Vega , Nancy Hitschfeld

Accelerating Compact Fractals with Tensor Core GPUs

This work presents a GPU thread mapping approach that allows doing fast parallel stencil-like computations on discrete fractals using their compact representation. The intuition behind is to employ two GPU tensor-core accelerated thread…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-26 Felipe A. Quezada , Cristóbal A. Navarro

Leveraging Mathematical Reasoning of LLMs for Efficient GPU Thread Mapping

Mapping parallel threads onto non-box-shaped domains is a known challenge in GPU computing; efficient mapping prevents performance penalties from unnecessary resource allocation. Currently, achieving this requires significant analytical…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-15 Jose Maureira , Cristóbal A. Navarro , Hector Ferrada , Luis Veas-Castillo

Accelerating Deterministic Global Optimization via GPU-parallel Interval Arithmetic

Spatial Branch and Bound (B&B) algorithms are widely used for solving nonconvex problems to global optimality, yet they remain computationally expensive. Though some works have been carried out to speed up B&B via CPU parallelization, GPU…

Optimization and Control · Mathematics 2025-07-29 Hongzhen Zhang , Tim Kerkenhoff , Neil Kichler , Manuel Dahmen , Alexander Mitsos , Uwe Naumann , Dominik Bongartz

From Sequential Nodes to GPU Batches: Parallel Branch and Bound for Optimal $k$-Sparse GLMs

GPUs have significantly accelerated first-order methods for large-scale optimization, especially in continuous optimization. However, this success has not transferred cleanly to problems with discrete variables, combinatorial structure, and…

Machine Learning · Computer Science 2026-05-22 Jiachang Liu , Andrea Lodi

A Graph-based Model for GPU Caching Problems

Modeling data sharing in GPU programs is a challenging task because of the massive parallelism and complex data sharing patterns provided by GPU architectures. Better GPU caching efficiency can be achieved through careful task scheduling…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-10-04 Lingda Li , Ari B. Hayes , Stephen A. Hackler , Eddy Z. Zhang , Mario Szegedy , Shuaiwen Leon Song

Thread Batching for High-performance Energy-efficient GPU Memory Design

Massive multi-threading in GPU imposes tremendous pressure on memory subsystems. Due to rapid growth in thread-level parallelism of GPU and slowly improved peak memory bandwidth, the memory becomes a bottleneck of GPU's performance and…

Hardware Architecture · Computer Science 2019-06-17 Bing Li , Mengjie Mao , Xiaoxiao Liu , Tao Liu , Zihao Liu , Wujie Wen , Yiran Chen , Hai , Li

Accelerating Domain Propagation: an Efficient GPU-Parallel Algorithm over Sparse Matrices

Fast domain propagation of linear constraints has become a crucial component of today's best algorithms and solvers for mixed integer programming and pseudo-boolean optimization to achieve peak solving performance. Irregularities in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-08-26 Boro Sofranac , Ambros Gleixner , Sebastian Pokutta

Optimizing Bloom Filters for Modern GPU Architectures

Bloom filters are a fundamental data structure for approximate membership queries, with applications ranging from data analytics to databases and genomics. Several variants have been proposed to accommodate parallel architectures. GPUs,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-18 Daniel Jünger , Kevin Kristensen , Yunsong Wang , Xiangyao Yu , Bertil Schmidt

Accelerating Bidiagonalization of Banded Matrices through Memory-Aware Bulge-Chasing on GPUs

The reduction of a banded matrix to bidiagonal form is a critical step in the calculation of Singular Values, a cornerstone of scientific computing and AI. Although inherently parallel, this step has traditionally been considered unsuitable…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-14 Evelyne Ringoot , Rabab Alomairy , Alan Edelman

Parallel-in-Time Nonlinear Optimal Control via GPU-native Sequential Convex Programming

Real-time trajectory optimization for nonlinear constrained autonomous systems is critical and typically performed by CPU-based sequential solvers. Specifically, reliance on global sparse linear algebra or the serial nature of dynamic…

Robotics · Computer Science 2026-03-13 Yilin Zou , Zhong Zhang , Maxime Robic , Fanghua Jiang

On Parallel Solution of Sparse Triangular Linear Systems in CUDA

The acceleration of sparse matrix computations on modern many-core processors, such as the graphics processing units (GPUs), has been recognized and studied over a decade. Significant performance enhancements have been achieved for many…

Mathematical Software · Computer Science 2017-10-16 Ruipeng Li

Exploring the Limits of GPUs With Parallel Graph Algorithms

In this paper, we explore the limits of graphics processors (GPUs) for general purpose parallel computing by studying problems that require highly irregular data access patterns: parallel graph algorithms for list ranking and connected…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-02-25 Frank Dehne , Kumanan Yogaratnam

Improving GPU Performance Through Resource Sharing

Graphics Processing Units (GPUs) consisting of Streaming Multiprocessors (SMs) achieve high throughput by running a large number of threads and context switching among them to hide execution latencies. The number of thread blocks, and hence…

Hardware Architecture · Computer Science 2015-06-08 Vishwesh Jatala , Jayvant Anantpur , Amey Karkare

GPU-RMQ: Accelerating Range Minimum Queries on Modern GPUs

Range minimum queries are frequently used in string processing and database applications including biological sequence analysis, document retrieval, and web search. Hence, various data structures have been proposed for improving their…

Databases · Computer Science 2026-04-03 Lara Kreis , Justus Henneberg , Valentin Henkys , Felix Schuhknecht , Bertil Schmidt