Related papers: Efficient Synchronization Primitives for GPUs

A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs

GPUs are playing an increasingly important role in general-purpose computing. Many algorithms require synchronizations at different levels of granularity in a single GPU. Additionally, the emergence of dense GPU nodes also calls for…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-14 Lingqi Zhang , Mohamed Wahib , Haoyu Zhang , Satoshi Matsuoka

Analysis of Synchronization Mechanisms in Operating Systems

This research analyzed the performance and consistency of four synchronization mechanisms-reentrant locks, semaphores, synchronized methods, and synchronized blocks-across three operating systems: macOS, Windows, and Linux. Synchronization…

Operating Systems · Computer Science 2024-09-18 Oluwatoyin Kode , Temitope Oyemade

Efficient hybrid topology optimization using GPU and homogenization based multigrid approach

We propose a new hybrid topology optimization algorithm based on multigrid approach that combines the parallelization strategy of CPU using OpenMP and heavily multithreading capabilities of modern Graphics Processing Units (GPU). In…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-01 Arya Prakash Padhi , Souvik Chakraborty , Anupam Chakrabarti , Rajib Chowdhury

Theoretical Foundations of GPU-Native Compilation for Rapid Code Iteration

Current AI code generation systems suffer from significant latency bottlenecks due to CPU-GPU data transfers during compilation, execution, and testing phases. We establish theoretical foundations for three complementary approaches to…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-15 Adilet Metinov , Gulida M. Kudakeeva , Gulnara D. Kabaeva

An efficient implementation of parallel simulated annealing algorithm in GPUs

In this work we propose a highly optimized version of a simulated annealing (SA) algorithm adapted to the more recently developed Graphic Processor Units (GPUs). The programming has been carried out with CUDA toolkit, specially designed for…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-02 A. M. Ferreiro , J. A. García , J. G. López-Salas , C. Vázquez

Primitives for Contract-based Synchronization

We investigate how contracts can be used to regulate the interaction between processes. To do that, we study a variant of the concurrent constraints calculus presented in [1], featuring primitives for multi-party synchronization via…

Programming Languages · Computer Science 2010-10-28 Massimo Bartoletti , Roberto Zunino

A Framework for Fine-Grained Synchronization of Dependent GPU Kernels

Machine Learning (ML) models execute several parallel computations including Generalized Matrix Multiplication, Convolution, Dropout, etc. These computations are commonly executed on Graphics Processing Units (GPUs), by dividing the…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-02-15 Abhinav Jangda , Saeed Maleki , Maryam Mehri Dehnavi , Madan Musuvathi , Olli Saarikivi

Instant Neural Graphics Primitives with a Multiresolution Hash Encoding

Neural graphics primitives, parameterized by fully connected neural networks, can be costly to train and evaluate. We reduce this cost with a versatile new input encoding that permits the use of a smaller network without sacrificing…

Computer Vision and Pattern Recognition · Computer Science 2022-05-05 Thomas Müller , Alex Evans , Christoph Schied , Alexander Keller

A Hybrid Multi-GPU Implementation of Simplex Algorithm with CPU Collaboration

The simplex algorithm has been successfully used for many years in solving linear programming (LP) problems. Due to the intensive computations required (especially for the solution of large LP problems), parallel approaches have also…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-22 Basilis Mamalis , Marios Perlitis

A Compiler Framework for Optimizing Dynamic Parallelism on GPUs

Dynamic parallelism on GPUs allows GPU threads to dynamically launch other GPU threads. It is useful in applications with nested parallelism, particularly where the amount of nested parallelism is irregular and cannot be predicted…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-01-11 Mhd Ghaith Olabi , Juan Gómez Luna , Onur Mutlu , Wen-mei Hwu , Izzat El Hajj

SU(2) Lattice Gauge Theory Simulations on Fermi GPUs

In this work we explore the performance of CUDA in quenched lattice SU(2) simulations. CUDA, NVIDIA Compute Unified Device Architecture, is a hardware and software architecture developed by NVIDIA for computing on the GPU. We present an…

High Energy Physics - Lattice · Physics 2015-03-17 Nuno Cardoso , Pedro Bicudo

Specifying and Testing GPU Workgroup Progress Models

As GPU availability has increased and programming support has matured, a wider variety of applications are being ported to these platforms. Many parallel applications contain fine-grained synchronization idioms; as such, their correct…

Programming Languages · Computer Science 2021-09-14 Tyler Sorensen , Lucas F. Salvador , Harmit Raval , Hugues Evrard , John Wickerson , Margaret Martonosi , Alastair F. Donaldson

Simulating Lattice Spin Models on Graphics Processing Units

Lattice spin models are useful for studying critical phenomena and allow the extraction of equilibrium and dynamical properties. Simulations of such systems are usually based on Monte Carlo (MC) techniques, and the main difficulty is often…

Computational Physics · Physics 2012-09-13 Tal Levy , Guy Cohen , Eran Rabani

Co-Optimizing Performance and Memory FootprintVia Integrated CPU/GPU Memory Management, anImplementation on Autonomous Driving Platform

Cutting-edge embedded system applications, such as self-driving cars and unmanned drone software, are reliant on integrated CPU/GPU platforms for their DNNs-driven workload, such as perception and other highly parallel components. In this…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-03-20 Soroush Bateni , Zhendong Wang , Yuankun Zhu , Yang Hu , Cong Liu

Efficient GPU Implementation of Particle Interactions with Cutoff Radius and Few Particles per Cell

This paper presents novel approaches to parallelizing particle interactions on a GPU when there are few particles per cell and the interactions are limited by a cutoff distance. The paper surveys classical algorithms and then introduces two…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-25 David Algis , Berenger Bramas , Emmanuelle Darles , Lilian Aveneau

NB-FEB: An Easy-to-Use and Scalable Universal Synchronization Primitive for Parallel Programming

This paper addresses the problem of universal synchronization primitives that can support scalable thread synchronization for large-scale many-core architectures. The universal synchronization primitives that have been deployed widely in…

Distributed, Parallel, and Cluster Computing · Computer Science 2008-11-11 Phuong Hoai Ha , Philippas Tsigas , Otto J. Anshus

Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution

Deploying deep neural networks on mobile devices is increasingly important but remains challenging due to limited computing resources. On the other hand, their unified memory architecture and narrower gap between CPU and GPU performance…

Machine Learning · Computer Science 2026-02-20 Zhuojin Li , Marco Paolieri , Leana Golubchik

GPU Semiring Primitives for Sparse Neighborhood Methods

High-performance primitives for mathematical operations on sparse vectors must deal with the challenges of skewed degree distributions and limits on memory consumption that are typically not issues in dense operations. We demonstrate that a…

Machine Learning · Computer Science 2022-03-08 Corey J. Nolet , Divye Gala , Edward Raff , Joe Eaton , Brad Rees , John Zedlewski , Tim Oates

PAGANI: A Parallel Adaptive GPU Algorithm for Numerical

We present a new adaptive parallel algorithm for the challenging problem of multi-dimensional numerical integration on massively parallel architectures. Adaptive algorithms have demonstrated the best performance, but efficient many-core…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-24 Ioannis Sakiotis , Kamesh Arumugam , Marc Paterno , Desh Ranjan , Balša Terzić , Mohammad Zubair

Dissecting GPU Memory Hierarchy through Microbenchmarking

Memory access efficiency is a key factor in fully utilizing the computational power of graphics processing units (GPUs). However, many details of the GPU memory hierarchy are not released by GPU vendors. In this paper, we propose a novel…

Hardware Architecture · Computer Science 2016-03-15 Xinxin Mei , Xiaowen Chu