Related papers: Automatic Loop Kernel Analysis and Performance Mod…

Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels

Achieving optimal program performance requires deep insight into the interaction between hardware and software. For software developers without an in-depth background in computer architecture, understanding and fully utilizing modern…

Performance · Computer Science 2018-07-09 Julian Hammer , Jan Eitzinger , Georg Hager , Gerhard Wellein

Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels

Useful models of loop kernel runtimes on out-of-order architectures require an analysis of the in-core performance behavior of instructions and their dependencies. While an instruction throughput prediction sets a lower bound to the kernel…

Performance · Computer Science 2020-06-25 Jan Laukemann , Julian Hammer , Georg Hager , Gerhard Wellein

An analytic performance model for overlapping execution of memory-bound loop kernels on multicore CPUs

Complex applications running on multicore processors show a rich performance phenomenology. The growing number of cores per ccNUMA domain complicates performance analysis of memory-bound code since system noise, load imbalance, or…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-03 Ayesha Afzal , Georg Hager , Gerhard Wellein

Kernel Looping: Eliminating Synchronization Boundaries for Peak Inference Performance

Token generation speed is critical to power the next wave of AI inference applications. GPUs significantly underperform during token generation due to synchronization overheads at kernel boundaries, utilizing only 21% of their peak memory…

Computation and Language · Computer Science 2024-11-01 David Koeplinger , Darshan Gandhi , Pushkar Nandkar , Nathan Sheeley , Matheen Musaddiq , Leon Zhang , Reid Goodbar , Matthew Shaffer , Han Wang , Angela Wang , Mingran Wang , Raghu Prabhakar

Introducing a Performance Model for Bandwidth-Limited Loop Kernels

We present a performance model for bandwidth limited loop kernels which is founded on the analysis of modern cache based microarchitectures. This model allows an accurate performance prediction and evaluation for existing instruction codes.…

Performance · Computer Science 2009-05-07 Jan Treibig , Georg Hager

A Unified, Hardware-Fitted, Cross-GPU Performance Model

We present a mechanism to symbolically gather performance-relevant operation counts from numerically-oriented subprograms (`kernels') expressed in the Loopy programming system, and apply these counts in a simple, linear model of kernel run…

Performance · Computer Science 2016-04-19 James Stevens , Andreas Klöckner

Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model

Stencil algorithms on regular lattices appear in many fields of computational science, and much effort has been put into optimized implementations. Such activities are usually not guided by performance models that provide estimates of…

Performance · Computer Science 2016-01-28 Holger Stengel , Jan Treibig , Georg Hager , Gerhard Wellein

CARM Tool: Cache-Aware Roofline Model Automatic Benchmarking and Application Analysis

In recent years, HPC systems and CPU architectures as their central components, have become increasingly complex, making application development and optimization quite challenging. In this respect, intuitive performance models like the…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-29 José Morgado , Leonel Sousa , Aleksandar Ilic

Analytic Performance Modeling and Analysis of Detailed Neuron Simulations

Big science initiatives are trying to reconstruct and model the brain by attempting to simulate brain tissue at larger scales and with increasingly more biological detail than previously thought possible. The exponential growth of parallel…

Performance · Computer Science 2020-06-25 Francesco Cremonesi , Georg Hager , Gerhard Wellein , Felix Schürmann

OpenCL Performance Prediction using Architecture-Independent Features

OpenCL is an attractive model for heterogeneous high-performance computing systems, with wide support from hardware vendors and significant performance portability. To support efficient scheduling on HPC systems it is necessary to perform…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-11-04 Beau Johnston , Greg Falzon , Josh Milthorpe

A Model for Circuit Execution Runtime And Its Implications for Quantum Kernels At Practical Data Set Sizes

Quantum machine learning (QML) is a fast-growing discipline within quantum computing. One popular QML algorithm, quantum kernel estimation, uses quantum circuits to estimate a similarity measure (kernel) between two classical feature…

Quantum Physics · Physics 2023-07-12 Travis L. Scholten , Derrick Perry , Joseph Washington , Jennifer R. Glick , Thomas Ward

Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures

An accurate prediction of scheduling and execution of instruction streams is a necessary prerequisite for predicting the in-core performance behavior of throughput-bound loop kernels on out-of-order processor architectures. Such predictions…

Performance · Computer Science 2020-06-25 Jan Laukemann , Julian Hammer , Johannes Hofmann , Georg Hager , Gerhard Wellein

Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control

Reinforcement learning (RL) has enabled complex reasoning abilities in large language models (LLMs). However, most RL algorithms suffer from performance saturation, preventing continued gains as RL training scales. This problem can be…

Machine Learning · Computer Science 2026-05-12 Bolian Li , Yifan Wang , Yi Ding , Anamika Lochab , Ananth Grama , Ruqi Zhang

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures: A Machine Learning Based Approach

This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-03-10 Peng Zhang , Jianbin Fang , Canqun Yang , Chun Huang , Tao Tang , Zheng Wang

Time-Based Roofline for Deep Learning Performance Analysis

Deep learning applications are usually very compute-intensive and require a long run time for training and inference. This has been tackled by researchers from both hardware and software sides, and in this paper, we propose a Roofline-based…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-24 Yunsong Wang , Charlene Yang , Steven Farrell , Yan Zhang , Thorsten Kurth , Samuel Williams

Benchmarking of CPU-intensive Stream Data Processing in The Edge Computing Systems

Edge computing has emerged as a pivotal technology, offering significant advantages such as low latency, enhanced data security, and reduced reliance on centralized cloud infrastructure. These benefits are crucial for applications requiring…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-24 Tomasz Szydlo , Viacheslav Horbanov , Devki Nandan Jha , Shashikant Ilager , Aleksander Slominski , Rajiv Ranjan

Analytical Characterization and Design Space Exploration for Optimization of CNNs

Moving data through the memory hierarchy is a fundamental bottleneck that can limit the performance of core algorithms of machine learning, such as convolutional neural networks (CNNs). Loop-level optimization, including loop tiling and…

Machine Learning · Computer Science 2021-04-13 Rui Li , Yufan Xu , Aravind Sukumaran-Rajam , Atanas Rountev , P. Sadayappan

Performance Analysis of Traditional and Data-Parallel Primitive Implementations of Visualization and Analysis Kernels

Measurements of absolute runtime are useful as a summary of performance when studying parallel visualization and analysis methods on computational platforms of increasing concurrency and complexity. We can obtain even more insights by…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-07 E. Wes Bethel , David Camp , Talita Perciano , Colleen Heinemann

Bridging the Architecture Gap: Abstracting Performance-Relevant Properties of Modern Server Processors

We describe a universal modeling approach for predicting single- and multicore runtime of steady-state loops on server processors. To this end we strictly differentiate between application and machine models: An application model comprises…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-30 Johannes Hofmann , Christie L. Alappat , Georg Hager , Dietmar Fey , Gerhard Wellein

Telling neuronal apples from oranges: analytical performance modeling of neural tissue simulations

Computational modeling and simulation have become essential tools in the quest to better understand the brain's makeup and to decipher the causal interrelations of its components. The breadth of biochemical and biophysical processes and…

Neurons and Cognition · Quantitative Biology 2019-06-10 Francesco Cremonesi , Felix Schürmann