Related papers: Optimising GPGPU Execution Through Runtime Micro-A…

Decoupled Control Flow and Data Access in RISC-V GPGPUs

Vortex, a newly proposed open-source GPGPU platform based on the RISC-V ISA, offers a valid alternative for GPGPU research over the broadly-used modeling platforms based on commercial GPUs. Similarly to the push originating from the RISC-V…

Hardware Architecture · Computer Science 2025-12-02 Giuseppe M. Sarda , Nimish Shah , Abubakr Nada , Debjyoti Bhattacharjee , Marian Verhelst

GPU backed Data Mining on Android Devices

Choosing an appropriate programming paradigm for high-performance computing on low-power devices can be useful to speed up calculations. Many Android devices have an integrated GPU and - although not officially supported - the OpenCL…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-10 Robert Fritze , Claudia Plant

Source Code Classification for Energy Efficiency in Parallel Ultra Low-Power Microcontrollers

The analysis of source code through machine learning techniques is an increasingly explored research topic aiming at increasing smartness in the software toolchain to exploit modern architectures in the best possible way. In the case of…

Machine Learning · Computer Science 2020-12-15 Emanuele Parisi , Francesco Barchi , Andrea Bartolini , Giuseppe Tagliavini , Andrea Acquaviva

Prediction of Performance and Power Consumption of GPGPU Applications

Graphics Processing Units (GPUs) have become an integral part of High-Performance Computing to achieve an Exascale performance. The main goal of application developers of GPU is to tune their code extensively to obtain optimal performance,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-04 Gargi Alavani , Santonu Sarkar

Power-Capping Metric Evaluation for Improving Energy Efficiency in HPC Applications

With high-performance computing systems now running at exascale, optimizing power-scaling management and resource utilization has become more critical than ever. This paper explores runtime power-capping optimizations that leverage…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-26 Maria Patrou , Thomas Wang , Wael Elwasif , Markus Eisenbach , Ross Miller , William Godoy , Oscar Hernandez

Preparing for Performance Analysis at Exascale

Performance tools for emerging heterogeneous exascale platforms must address two principal challenges when analyzing execution measurements. First, measurement of large-scale executions may record mountains of performance data. Second,…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-11 Jonathon Anderson , Yumeng Liu , John Mellor-Crummey

GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs

In high-performance computing, hotspot GPU kernels are primary bottlenecks, and expert manual tuning is costly and hard to port. Large language model methods often assume kernels can be compiled and executed cheaply, which fails in large…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-30 Ruifan Chu , Anbang Wang , Xiuxiu Bai , Shuai Liu , Xiaoshe Dong

Performance Analysis of Traditional and Data-Parallel Primitive Implementations of Visualization and Analysis Kernels

Measurements of absolute runtime are useful as a summary of performance when studying parallel visualization and analysis methods on computational platforms of increasing concurrency and complexity. We can obtain even more insights by…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-07 E. Wes Bethel , David Camp , Talita Perciano , Colleen Heinemann

A mechanism for balancing accuracy and scope in cross-machine black-box GPU performance modeling

The ability to model, analyze, and predict execution time of computations is an important building block supporting numerous efforts, such as load balancing, performance optimization, and automated performance tuning for high performance,…

Performance · Computer Science 2020-06-22 James D. Stevens , Andreas Klöckner

Dissecting RISC-V Performance: Practical PMU Profiling and Hardware-Agnostic Roofline Analysis on Emerging Platforms

As RISC-V architectures proliferate across embedded and high-performance domains, developers face persistent challenges in performance optimization due to fragmented tooling, immature hardware features, and platform-specific defects. This…

Performance · Computer Science 2025-07-31 Alexander Batashev

Data-Driven Analysis to Understand GPU Hardware Resource Usage of Optimizations

With heterogeneous systems, the number of GPUs per chip increases to provide computational capabilities for solving science at a nanoscopic scale. However, low utilization for single GPUs defies the need to invest more money for expensive…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-11 Tanzima Z. Islam , Aniruddha Marathe , Holland Schutte , Mohammad Zaeed

e-GPU: An Open-Source and Configurable RISC-V Graphic Processing Unit for TinyAI Applications

Graphics processing units (GPUs) excel at parallel processing, but remain largely unexplored in ultra-low-power edge devices (TinyAI) due to their power and area limitations, as well as the lack of suitable programming frameworks. To…

Hardware Architecture · Computer Science 2026-03-17 Simone Machetti , Pasquale Davide Schiavone , Lara Orlandic , Darong Huang , Deniz Kasap , Giovanni Ansaloni , David Atienza

Scalable GPU Performance Variability Analysis framework

Analyzing large-scale performance logs from GPU profilers often requires terabytes of memory and hours of runtime, even for basic summaries. These constraints prevent timely insight and hinder the integration of performance analytics into…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-27 Ankur Lahiry , Ayush Pokharel , Seth Ockerman , Amal Gueroudji , Line Pouchard , Tanzima Z. Islam

Analyzing Modern NVIDIA GPU cores

GPUs are the most popular platform for accelerating HPC workloads, such as artificial intelligence and science simulations. However, most microarchitectural research in academia relies on GPU core pipeline designs based on architectures…

Hardware Architecture · Computer Science 2025-10-30 Rodrigo Huerta , Mojtaba Abaie Shoushtary , José-Lorenzo Cruz , Antonio González

Vortex: Overcoming Memory Capacity Limitations in GPU-Accelerated Large-Scale Data Analytics

Despite the high computational throughput of GPUs, limited memory capacity and bandwidth-limited CPU-GPU communication via PCIe links remain significant bottlenecks for accelerating large-scale data analytics workloads. This paper…

Databases · Computer Science 2025-02-14 Yichao Yuan , Advait Iyer , Lin Ma , Nishil Talati

GPGPU Performance Estimation with Core and Memory Frequency Scaling

Graphics Processing Units (GPUs) support dynamic voltage and frequency scaling (DVFS) in order to balance computational performance and energy consumption. However, there still lacks simple and accurate performance estimation of a given GPU…

Performance · Computer Science 2018-06-14 Qiang Wang , Xiaowen Chu

A Mixed Precision, Multi-GPU Design for Large-scale Top-K Sparse Eigenproblems

Graph analytics techniques based on spectral methods process extremely large sparse matrices with millions or even billions of non-zero values. Behind these algorithms lies the Top-K sparse eigenproblem, the computation of the largest…

Hardware Architecture · Computer Science 2022-01-20 Francesco Sgherzi , Alberto Parravicini , Marco Domenico Santambrogio

Runtime Performances Benchmark for Knowledge Graph Embedding Methods

This paper wants to focus on providing a characterization of the runtime performances of state-of-the-art implementations of KGE alghoritms, in terms of memory footprint and execution time. Despite the rapidly growing interest in KGE…

Machine Learning · Computer Science 2020-11-10 Angelica Sofia Valeriani

ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels

GPU-based HPC clusters are attracting more scientific application developers due to their extensive parallelism and energy efficiency. In order to achieve portability among a variety of multi/many core architectures, a popular choice for an…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-04-10 Ali TehraniJamsaz , Alok Mishra , Akash Dutta , Abid M. Malik , Barbara Chapman , Ali Jannesari

Dissecting the Graphcore IPU Architecture via Microbenchmarking

This report focuses on the architecture and performance of the Intelligence Processing Unit (IPU), a novel, massively parallel platform recently introduced by Graphcore and aimed at Artificial Intelligence/Machine Learning (AI/ML)…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-12-10 Zhe Jia , Blake Tillman , Marco Maggioni , Daniele Paolo Scarpazza