Related papers: Revealing NVIDIA Closed-Source Driver Command Stre…

CUDA Tutorial -- Cryptanalysis of Classical Ciphers Using Modern GPUs and CUDA

CUDA (formerly an abbreviation of Compute Unified Device Architecture) is a parallel computing platform and API model created by Nvidia allowing software developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose…

Cryptography and Security · Computer Science 2021-09-14 Miroslav Dimitrov , Bernhard Esslinger

A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs

GPUs are playing an increasingly important role in general-purpose computing. Many algorithms require synchronizations at different levels of granularity in a single GPU. Additionally, the emergence of dense GPU nodes also calls for…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-14 Lingqi Zhang , Mohamed Wahib , Haoyu Zhang , Satoshi Matsuoka

GPGPU Computing

Since the first idea of using GPU to general purpose computing, things have evolved over the years and now there are several approaches to GPU programming. GPU computing practically began with the introduction of CUDA (Compute Unified…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-02-09 Bogdan Oancea , Tudorel Andrei , Raluca Mariana Dragoescu

GPU implementation of a ray-surface intersection algorithm in CUDA (Compute Unified Device Architecture)

These notes accompany the open-source code published in GitHub which implements a GPU-based line-segment, surface-triangle intersection algorithm in CUDA. It mentions some relevant works and discusses issues specific to this implementation.…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-08 Raymond Leung

GPGPU Processing in CUDA Architecture

The future of computation is the Graphical Processing Unit, i.e. the GPU. The promise that the graphics cards have shown in the field of image processing and accelerated rendering of 3D scenes, and the computational capability that these…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-02-21 Jayshree Ghorpade , Jitendra Parande , Madhura Kulkarni , Amit Bawaskar

Accelerating Intra-Node GPU-to-GPU Communication Through Multi-Path Transfers with CUDA Graphs

Effective intra-node GPU communication is essential for optimizing performance in MPI-based HPC applications, especially when leveraging multiple communication paths. In this study, we propose a novel approach that integrates CUDA Graphs…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-28 Amirhossein Sojoodi , Yiltan Hassan Temucin , Amirreza Baratisedeh , Hamed Sharifian , Ahmad Afsahi

Low Overhead Instruction Latency Characterization for NVIDIA GPGPUs

The last decade has seen a shift in the computer systems industry where heterogeneous computing has become prevalent. Graphics Processing Units (GPUs) are now present in supercomputers to mobile phones and tablets. GPUs are used for…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-04 Yehia Arafa , Abdel-Hameed Badawy , Gopinath Chennupati , Nandakishore Santhi , Stephan Eidenbenz

Implementing CUDA Streams into AstroAccelerate -- A Case Study

To be able to run tasks asynchronously on NVIDIA GPUs a programmer must explicitly implement asynchronous execution in their code using the syntax of CUDA streams. Streams allow a programmer to launch independent concurrent execution tasks,…

Instrumentation and Methods for Astrophysics · Physics 2021-05-07 Jan Novotný , Karel Adámek , Wes Armour

GPU computing for 2-d spin systems: CUDA vs OpenGL

In recent years the more and more powerful GPU's available on the PC market have attracted attention as a cost effective solution for parallel (SIMD) computing. CUDA is a solid evidence of the attention that the major companies are devoting…

High Energy Physics - Lattice · Physics 2010-01-21 Viola Anselmi , Giovanni Conti , Francesco Di Renzo

Supporting CUDA for an extended RISC-V GPU architecture

With the rapid development of scientific computation, more and more researchers and developers are committed to implementing various workloads/operations on different devices. Among all these devices, NVIDIA GPU is the most popular choice…

Programming Languages · Computer Science 2021-09-03 Ruobing Han , Blaise Tine , Jaewon Lee , Jaewoong Sim , Hyesoon Kim

Fast Histograms using Adaptive CUDA Streams

Histograms are widely used in medical imaging, network intrusion detection, packet analysis and other stream-based high throughput applications. However, while porting such software stacks to the GPU, the computation of the histogram is a…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-11-02 Sisir Koppaka , Dheevatsa Mudigere , Srihari Narasimhan , Babu Narayanan

Descend: A Safe GPU Systems Programming Language

Graphics Processing Units (GPU) offer tremendous computational power by following a throughput oriented computing paradigm where many thousand computational units operate in parallel. Programming this massively parallel hardware is…

Programming Languages · Computer Science 2023-05-08 Bastian Köpcke , Sergei Gorlatch , Michel Steuwer

Analyzing Modern NVIDIA GPU cores

GPUs are the most popular platform for accelerating HPC workloads, such as artificial intelligence and science simulations. However, most microarchitectural research in academia relies on GPU core pipeline designs based on architectures…

Hardware Architecture · Computer Science 2025-10-30 Rodrigo Huerta , Mojtaba Abaie Shoushtary , José-Lorenzo Cruz , Antonio González

CuPBoP: CUDA for Parallelized and Broad-range Processors

CUDA is one of the most popular choices for GPU programming, but it can only be executed on NVIDIA GPUs. Executing CUDA on non-NVIDIA devices not only benefits the hardware community, but also allows data-parallel computation in…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-17 Ruobing Han , Jun Chen , Bhanu Garg , Jeffrey Young , Jaewoong Sim , Hyesoon Kim

Accelerator architectures specialize in executing SIMD (single instruction, multiple data) in lockstep. Because the majority of CUDA applications are parallelized loops, control flow information can provide an in-depth characterization of a…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-03-25 Robert Lim , Boyana Norris , Allen Malony

CUDA Support in GNA Data Analysis Framework

Usage of GPUs as co-processors is a well-established approach to accelerate costly algorithms operating on matrices and vectors. We aim to further improve the performance of the Global Neutrino Analysis framework (GNA) by adding GPU support…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-04-23 Anna Fatkina , Maxim Gonchar , Liudmila Kolupaeva , Dmitry Naumov , Konstantin Treskov

CUDA Kernel Optimization and Counter-Free Performance Analysis for Depthwise Convolution in Cloud Environments

Efficient GPU execution of convolution operators is governed by memory-access efficiency, on-chip data reuse, and execution mapping rather than arithmetic throughput alone. This paper presents a controlled operator-level study of CUDA…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-30 Huriyeh Babak , Melanie Schaller

Blueprint, Bootstrap, and Bridge: A Security Look at NVIDIA GPU Confidential Computing

NVIDIA GPU Confidential Computing (GPU-CC) aims to provide secure execution for AI workloads. For end users, enabling GPU-CC is seamless and requires no modifications to existing applications. However, this ease of adoption relies on a…

Cryptography and Security · Computer Science 2026-04-20 Zhongshu Gu , Enriquillo Valdez , Salman Ahmed , Julian James Stephen , Michael Le , Hani Jamjoom , Shixuan Zhao , Zhiqiang Lin

CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM

The share of the top 500 supercomputers with NVIDIA GPUs is now over 25% and continues to grow. While fault tolerance is a critical issue for supercomputing, there does not currently exist an efficient, scalable solution for CUDA…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-08-25 Twinkle Jain , Gene Cooperman

CUDA Leaks: Information Leakage in GPU Architectures

Graphics Processing Units (GPUs) are deployed on most present server, desktop, and even mobile platforms. Nowadays, a growing number of applications leverage the high parallelism offered by this architecture to speed-up general purpose…

Cryptography and Security · Computer Science 2016-02-29 Roberto Di Pietro , Flavio Lombardi , Antonio Villani