Related papers: Efficient GPU Thread Mapping on Embedded 2D Fracta…

Block-space GPU Mapping for Embedded Sierpi\'nski Gasket Fractals

This work studies the problem of GPU thread mapping for a Sierpi\'nski gasket fractal embedded in a discrete Euclidean space of $n \times n$. A block-space map $\lambda: \mathbb{Z}_{\mathbb{E}}^{2} \mapsto \mathbb{Z}_{\mathbb{F}}^{2}$ is…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-06-15 Cristóbal A. Navarro , Benjamín Bustos , Raimundo Vega , Nancy Hitschfeld

Accelerating Compact Fractals with Tensor Core GPUs

This work presents a GPU thread mapping approach that allows doing fast parallel stencil-like computations on discrete fractals using their compact representation. The intuition behind is to employ two GPU tensor-core accelerated thread…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-26 Felipe A. Quezada , Cristóbal A. Navarro

A Non-linear GPU Thread Map for Triangular Domains

There is a stage in the GPU computing pipeline where a grid of thread-blocks, in \textit{parallel space}, is mapped onto the problem domain, in \textit{data space}. Since the parallel space is restricted to a box type geometry, the mapping…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-09-07 Cristóbal A. Navarro , Benjamín Bustos , Nancy Hitschfeld

Improving the GPU space of computation under triangular domain problems

There is a stage in the GPU computing pipeline where a grid of thread-blocks is mapped to the problem domain. Normally, this grid is a k-dimensional bounding box that covers a k-dimensional problem no matter its shape. Threads that fall…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-08-27 Cristobal A. Navarro , Nancy Hitschfeld

Squeeze: Efficient Compact Fractals for Tensor Core GPUs

This work presents Squeeze, an efficient compact fractal processing scheme for tensor core GPUs. By combining discrete-space transformations between compact and expanded forms, one can do data-parallel computation on a fractal with…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-01-04 Felipe A. Quezada , Cristóbal A. Navarro , Nancy Hitschfeld , Benjamin Bustos

A Scalable and Energy Efficient GPU Thread Map for m-Simplex Domains

This work proposes a new GPU thread map for $m$-simplex domains, that scales its speedup with dimension and is energy efficient compared to other state of the art approaches. The main contributions of this work are i) the formulation of the…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-13 Cristóbal A. Navarro , Felipe A. Quezada , Benjamin Bustos , Nancy Hitschfeld , Rolando Kindelan

Possibilities of Recursive GPU Mapping for Discrete Orthogonal Simplices

The problem of parallel thread mapping is studied for the case of discrete orthogonal $m$-simplices. The possibility of a $O(1)$ time recursive block-space map $\lambda: \mathbb{Z}^m \mapsto \mathbb{Z}^m$ is analyzed from the point of view…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-10-25 Cristóbal A. Navarro , Benjamín Bustos , Nancy Hitscheld

Potential benefits of a block-space GPU approach for discrete tetrahedral domains

The study of data-parallel domain re-organization and thread-mapping techniques are relevant topics as they can increase the efficiency of GPU computations when working on spatial discrete domains with non-box-shaped geometry. In this work…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-06-30 Cristóbal A. Navarro , Benjamín Bustos , Nancy Hitschfeld

A Graph-based Model for GPU Caching Problems

Modeling data sharing in GPU programs is a challenging task because of the massive parallelism and complex data sharing patterns provided by GPU architectures. Better GPU caching efficiency can be achieved through careful task scheduling…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-10-04 Lingda Li , Ari B. Hayes , Stephen A. Hackler , Eddy Z. Zhang , Mario Szegedy , Shuaiwen Leon Song

GPU-Accelerated Algorithms for Process Mapping

Process mapping asks to assign vertices of a task graph to processing elements of a supercomputer such that the computational workload is balanced while the communication cost is minimized. Motivated by the recent success of GPU-based graph…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-16 Petr Samoldekin , Christian Schulz , Henning Woydt

Towards 6D MCL for LiDARs in 3D TSDF Maps on Embedded Systems with GPUs

Monte Carlo Localization is a widely used approach in the field of mobile robotics. While this problem has been well studied in the 2D case, global localization in 3D maps with six degrees of freedom has so far been too computationally…

Robotics · Computer Science 2023-10-09 Marc Eisoldt , Alexander Mock , Mario Porrmann , Thomas Wiemann

Leveraging Mathematical Reasoning of LLMs for Efficient GPU Thread Mapping

Mapping parallel threads onto non-box-shaped domains is a known challenge in GPU computing; efficient mapping prevents performance penalties from unnecessary resource allocation. Currently, achieving this requires significant analytical…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-15 Jose Maureira , Cristóbal A. Navarro , Hector Ferrada , Luis Veas-Castillo

Accelerating Bidiagonalization of Banded Matrices through Memory-Aware Bulge-Chasing on GPUs

The reduction of a banded matrix to bidiagonal form is a critical step in the calculation of Singular Values, a cornerstone of scientific computing and AI. Although inherently parallel, this step has traditionally been considered unsuitable…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-14 Evelyne Ringoot , Rabab Alomairy , Alan Edelman

A GPU Based Memory Optimized Parallel Method For FFT Implementation

FFT (fast Fourier transform) plays a very important role in many fields, such as digital signal processing, digital image processing and so on. However, in application, FFT becomes a factor of affecting the processing efficiency, especially…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-07-25 Fan Zhang , Chen Hu , Qiang Yin , Wei Hu

Thread Batching for High-performance Energy-efficient GPU Memory Design

Massive multi-threading in GPU imposes tremendous pressure on memory subsystems. Due to rapid growth in thread-level parallelism of GPU and slowly improved peak memory bandwidth, the memory becomes a bottleneck of GPU's performance and…

Hardware Architecture · Computer Science 2019-06-17 Bing Li , Mengjie Mao , Xiaoxiao Liu , Tao Liu , Zihao Liu , Wujie Wen , Yiran Chen , Hai , Li

Optimizing Bloom Filters for Modern GPU Architectures

Bloom filters are a fundamental data structure for approximate membership queries, with applications ranging from data analytics to databases and genomics. Several variants have been proposed to accommodate parallel architectures. GPUs,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-18 Daniel Jünger , Kevin Kristensen , Yunsong Wang , Xiangyao Yu , Bertil Schmidt

A GPU-accelerated Branch-and-Bound Algorithm for the Flow-Shop Scheduling Problem

Branch-and-Bound (B&B) algorithms are time intensive tree-based exploration methods for solving to optimality combinatorial optimization problems. In this paper, we investigate the use of GPU computing as a major complementary way to speed…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-08-21 Melab Nouredine , Imen Chakroun , Mezmaz Mohand , Daniel Tuyttens

gMatch: Fine-Grained and Hardware-Efficient Subgraph Matching on GPUs

Subgraph matching is a core operation in graph analytics, supporting a broad spectrum of applications from social network analysis to bioinformatics. Recent GPU-based approaches accelerate subgraph matching by leveraging parallelism but…

Databases · Computer Science 2026-04-14 Weitian Chen , Shixuan Sun , Cheng Chen , Yongmin Hu , Yingqian Hu , Minyi Guo

GPU Accelerated Fractal Image Compression for Medical Imaging in Parallel Computing Platform

In this paper, we implemented both sequential and parallel version of fractal image compression algorithms using CUDA (Compute Unified Device Architecture) programming model for parallelizing the program in Graphics Processing Unit for…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-04-04 Md. Enamul Haque , Abdullah Al Kaisan , Mahmudur R Saniat , Aminur Rahman

Scalable Graph Embedding LearningOn A Single GPU

Graph embedding techniques have attracted growing interest since they convert the graph data into continuous and low-dimensional space. Effective graph analytic provides users a deeper understanding of what is behind the data and thus can…

Machine Learning · Computer Science 2022-01-21 Azita Nouri , Philip E. Davis , Pradeep Subedi , Manish Parashar