Related papers: Automated Programmatic Performance Analysis of Par…

Advanced Python Performance Monitoring with Score-P

Within the last years, Python became more prominent in the scientific community and is now used for simulations, machine learning, and data analysis. All these tasks profit from additional compute power offered by parallelism and…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-26 Andreas Gocht , Robert Schöne , Jan Frenzel

Pipit: Scripting the analysis of parallel execution traces

Performance analysis is a critical step in the oft-repeated, iterative process of performance tuning of parallel programs. Per-process, per-thread traces (detailed logs of events with timestamps) enable in-depth analysis of parallel program…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-15 Abhinav Bhatele , Rakrish Dhakal , Alexander Movsesyan , Aditya K. Ranjan , Onur Cankur

Tools for Analyzing Parallel I/O

Parallel application I/O performance often does not meet user expectations. Additionally, slight access pattern modifications may lead to significant changes in performance due to complex interactions between hardware and software. These…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-07-19 Julian M. Kunkel , Eugen Betke , Matt Bryson , Philip Carns , Rosemary Francis , Wolfgang Frings , Roland Laifer , Sandra Mendez

PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications

Compound AI applications, which compose calls to ML models using a general-purpose programming language like Python, are widely used for a variety of user-facing tasks, from software engineering to enterprise automation, making their…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-19 Stephen Mell , David Mell , Konstantinos Kallas , Steve Zdancewic , Osbert Bastani

ScALPEL: A Scalable Adaptive Lightweight Performance Evaluation Library for application performance monitoring

As supercomputers continue to grow in scale and capabilities, it is becoming increasingly difficult to isolate processor and system level causes of performance degradation. Over the last several years, a significant number of performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2009-03-03 Hari K. Pyla , Bharath Ramesh , Calvin J. Ribbens , Srinidhi Varadarajan

Making Applications Faster by Asynchronous Execution: Slowing Down Processes or Relaxing MPI Collectives

Comprehending the performance bottlenecks at the core of the intricate hardware-software interactions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatically asynchronous MPI…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-06 Ayesha Afzal , Georg Hager , Stefano Markidis , Gerhard Wellein

Proactive bottleneck performance analysis in parallel computing using openMP

The aim of parallel computing is to increase an application performance by executing the application on multiple processors. OpenMP is an API that supports multi platform shared memory programming model and shared-memory programs are…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-11-12 Vibha Rajput , Alok Katiyar

Performance Evaluation of Parallel Algorithms

Evaluating how well a whole system or set of subsystems performs is one of the primary objectives of performance testing. We can tell via performance assessment if the architecture implementation meets the design objectives. Performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-15 Donald Ene Vincent Ike Anireh

Chopper: A Multi-Level GPU Characterization Tool & Derived Insights Into LLM Training Inefficiency

Training large language models (LLMs) efficiently requires a deep understanding of how modern GPU systems behave under real-world distributed training workloads. While prior work has focused primarily on kernel-level performance or…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-10 Marco Kurzynski , Shaizeen Aga , Di Wu

A Study on Performance Analysis Tools for Applications Running on Large Distributed Systems

The evolution of distributed architectures and programming paradigms for performance-oriented program development, challenge the state-of-the-art technology for performance tools. The area of high performance computing is rapidly expanding…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-06-15 Ajanta De Sarkar , Nandini Mukherjee

THAPI: Tracing Heterogeneous APIs

As we reach exascale, production High Performance Computing (HPC) systems are increasing in complexity. These systems now comprise multiple heterogeneous computing components (CPUs and GPUs) utilized through diverse, often vendor-specific…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-15 Solomon Bekele , Aurelio Vivas , Thomas Applencourt , Servesh Muralidharan , Bryce Allen , Kazutomo Yoshiiinst , Swann Perarnau , Brice Videau

Different from sequential programs, parallel programs possess their own characteristics which are difficult to analyze in the multi-process or multi-thread environment. This paper presents an innovative method to automatically analyze the…

Distributed, Parallel, and Cluster Computing · Computer Science 2009-06-09 Xu Liu , Jianfeng Zhan , Bibo Tu , Ming Zou , Dan Meng

Preparing for Performance Analysis at Exascale

Performance tools for emerging heterogeneous exascale platforms must address two principal challenges when analyzing execution measurements. First, measurement of large-scale executions may record mountains of performance data. Second,…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-11 Jonathon Anderson , Yumeng Liu , John Mellor-Crummey

The ELAPS Framework: Experimental Linear Algebra Performance Studies

Optimal use of computing resources requires extensive coding, tuning and benchmarking. To boost developer productivity in these time consuming tasks, we introduce the Experimental Linear Algebra Performance Studies framework (ELAPS), a…

Performance · Computer Science 2015-05-01 Elmar Peise , Paolo Bientinesi

TaPS: A Performance Evaluation Suite for Task-based Execution Frameworks

Task-based execution frameworks, such as parallel programming libraries, computational workflow systems, and function-as-a-service platforms, enable the composition of distinct tasks into a single, unified application designed to achieve a…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-15 J. Gregory Pauloski , Valerie Hayot-Sasson , Maxime Gonthier , Nathaniel Hudson , Haochen Pan , Sicheng Zhou , Ian Foster , Kyle Chard

A Distributed Framework for Causal Modeling of Performance Variability in GPU Traces

Large-scale GPU traces play a critical role in identifying performance bottlenecks within heterogeneous High-Performance Computing (HPC) architectures. However, the sheer volume and complexity of a single trace of data make performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-22 Ankur Lahiry , Ayush Pokharel , Banooqa Banday , Seth Ockerman , Amal Gueroudji , Mohammad Zaeed , Tanzima Z. Islam , Line Pouchard

A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture

Parallel computing is a standard approach to achieving high-performance computing (HPC). Three commonly used methods to implement parallel computing include: 1) applying multithreading technology on single-core or multi-core CPUs; 2)…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-18 Xinyao Yi

HPAT: High Performance Analytics with Scripting Ease-of-Use

Big data analytics requires high programmer productivity and high performance simultaneously on large-scale clusters. However, current big data analytics frameworks (e.g. Apache Spark) have prohibitive runtime overheads since they are…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-04-12 Ehsan Totoni , Todd A. Anderson , Tatiana Shpeisman

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures: A Machine Learning Based Approach

This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-03-10 Peng Zhang , Jianbin Fang , Canqun Yang , Chun Huang , Tao Tang , Zheng Wang

Asynchronous Execution of Python Code on Task Based Runtime Systems

Despite advancements in the areas of parallel and distributed computing, the complexity of programming on High Performance Computing (HPC) resources has deterred many domain experts, especially in the areas of machine learning and…

Programming Languages · Computer Science 2019-03-08 R. Tohid , Bibek Wagle , Shahrzad Shirzad , Patrick Diehl , Adrian Serio , Alireza Kheirkhahan , Parsa Amini , Katy Williams , Kate Isaacs , Kevin Huck , Steven Brandt , Hartmut Kaiser