Related papers: Automated Programmatic Performance Analysis of Par…
Within the last years, Python became more prominent in the scientific community and is now used for simulations, machine learning, and data analysis. All these tasks profit from additional compute power offered by parallelism and…
Performance analysis is a critical step in the oft-repeated, iterative process of performance tuning of parallel programs. Per-process, per-thread traces (detailed logs of events with timestamps) enable in-depth analysis of parallel program…
Parallel application I/O performance often does not meet user expectations. Additionally, slight access pattern modifications may lead to significant changes in performance due to complex interactions between hardware and software. These…
Compound AI applications, which compose calls to ML models using a general-purpose programming language like Python, are widely used for a variety of user-facing tasks, from software engineering to enterprise automation, making their…
As supercomputers continue to grow in scale and capabilities, it is becoming increasingly difficult to isolate processor and system level causes of performance degradation. Over the last several years, a significant number of performance…
Comprehending the performance bottlenecks at the core of the intricate hardware-software interactions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatically asynchronous MPI…
The aim of parallel computing is to increase an application performance by executing the application on multiple processors. OpenMP is an API that supports multi platform shared memory programming model and shared-memory programs are…
Evaluating how well a whole system or set of subsystems performs is one of the primary objectives of performance testing. We can tell via performance assessment if the architecture implementation meets the design objectives. Performance…
Training large language models (LLMs) efficiently requires a deep understanding of how modern GPU systems behave under real-world distributed training workloads. While prior work has focused primarily on kernel-level performance or…
The evolution of distributed architectures and programming paradigms for performance-oriented program development, challenge the state-of-the-art technology for performance tools. The area of high performance computing is rapidly expanding…
As we reach exascale, production High Performance Computing (HPC) systems are increasing in complexity. These systems now comprise multiple heterogeneous computing components (CPUs and GPUs) utilized through diverse, often vendor-specific…
Different from sequential programs, parallel programs possess their own characteristics which are difficult to analyze in the multi-process or multi-thread environment. This paper presents an innovative method to automatically analyze the…
Performance tools for emerging heterogeneous exascale platforms must address two principal challenges when analyzing execution measurements. First, measurement of large-scale executions may record mountains of performance data. Second,…
Optimal use of computing resources requires extensive coding, tuning and benchmarking. To boost developer productivity in these time consuming tasks, we introduce the Experimental Linear Algebra Performance Studies framework (ELAPS), a…
Task-based execution frameworks, such as parallel programming libraries, computational workflow systems, and function-as-a-service platforms, enable the composition of distinct tasks into a single, unified application designed to achieve a…
Large-scale GPU traces play a critical role in identifying performance bottlenecks within heterogeneous High-Performance Computing (HPC) architectures. However, the sheer volume and complexity of a single trace of data make performance…
Parallel computing is a standard approach to achieving high-performance computing (HPC). Three commonly used methods to implement parallel computing include: 1) applying multithreading technology on single-core or multi-core CPUs; 2)…
Big data analytics requires high programmer productivity and high performance simultaneously on large-scale clusters. However, current big data analytics frameworks (e.g. Apache Spark) have prohibitive runtime overheads since they are…
This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a…
Despite advancements in the areas of parallel and distributed computing, the complexity of programming on High Performance Computing (HPC) resources has deterred many domain experts, especially in the areas of machine learning and…