English
Related papers

Related papers: Automatic Loop Kernel Analysis and Performance Mod…

200 papers

Achieving optimal program performance requires deep insight into the interaction between hardware and software. For software developers without an in-depth background in computer architecture, understanding and fully utilizing modern…

Performance · Computer Science 2018-07-09 Julian Hammer , Jan Eitzinger , Georg Hager , Gerhard Wellein

Useful models of loop kernel runtimes on out-of-order architectures require an analysis of the in-core performance behavior of instructions and their dependencies. While an instruction throughput prediction sets a lower bound to the kernel…

Performance · Computer Science 2020-06-25 Jan Laukemann , Julian Hammer , Georg Hager , Gerhard Wellein

Complex applications running on multicore processors show a rich performance phenomenology. The growing number of cores per ccNUMA domain complicates performance analysis of memory-bound code since system noise, load imbalance, or…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-03 Ayesha Afzal , Georg Hager , Gerhard Wellein

Token generation speed is critical to power the next wave of AI inference applications. GPUs significantly underperform during token generation due to synchronization overheads at kernel boundaries, utilizing only 21% of their peak memory…

We present a performance model for bandwidth limited loop kernels which is founded on the analysis of modern cache based microarchitectures. This model allows an accurate performance prediction and evaluation for existing instruction codes.…

Performance · Computer Science 2009-05-07 Jan Treibig , Georg Hager

We present a mechanism to symbolically gather performance-relevant operation counts from numerically-oriented subprograms (`kernels') expressed in the Loopy programming system, and apply these counts in a simple, linear model of kernel run…

Performance · Computer Science 2016-04-19 James Stevens , Andreas Klöckner

Stencil algorithms on regular lattices appear in many fields of computational science, and much effort has been put into optimized implementations. Such activities are usually not guided by performance models that provide estimates of…

Performance · Computer Science 2016-01-28 Holger Stengel , Jan Treibig , Georg Hager , Gerhard Wellein

In recent years, HPC systems and CPU architectures as their central components, have become increasingly complex, making application development and optimization quite challenging. In this respect, intuitive performance models like the…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-29 José Morgado , Leonel Sousa , Aleksandar Ilic

Big science initiatives are trying to reconstruct and model the brain by attempting to simulate brain tissue at larger scales and with increasingly more biological detail than previously thought possible. The exponential growth of parallel…

Performance · Computer Science 2020-06-25 Francesco Cremonesi , Georg Hager , Gerhard Wellein , Felix Schürmann

OpenCL is an attractive model for heterogeneous high-performance computing systems, with wide support from hardware vendors and significant performance portability. To support efficient scheduling on HPC systems it is necessary to perform…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-11-04 Beau Johnston , Greg Falzon , Josh Milthorpe

Quantum machine learning (QML) is a fast-growing discipline within quantum computing. One popular QML algorithm, quantum kernel estimation, uses quantum circuits to estimate a similarity measure (kernel) between two classical feature…

Quantum Physics · Physics 2023-07-12 Travis L. Scholten , Derrick Perry , Joseph Washington , Jennifer R. Glick , Thomas Ward

An accurate prediction of scheduling and execution of instruction streams is a necessary prerequisite for predicting the in-core performance behavior of throughput-bound loop kernels on out-of-order processor architectures. Such predictions…

Performance · Computer Science 2020-06-25 Jan Laukemann , Julian Hammer , Johannes Hofmann , Georg Hager , Gerhard Wellein

Reinforcement learning (RL) has enabled complex reasoning abilities in large language models (LLMs). However, most RL algorithms suffer from performance saturation, preventing continued gains as RL training scales. This problem can be…

Machine Learning · Computer Science 2026-05-12 Bolian Li , Yifan Wang , Yi Ding , Anamika Lochab , Ananth Grama , Ruqi Zhang

This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-03-10 Peng Zhang , Jianbin Fang , Canqun Yang , Chun Huang , Tao Tang , Zheng Wang

Deep learning applications are usually very compute-intensive and require a long run time for training and inference. This has been tackled by researchers from both hardware and software sides, and in this paper, we propose a Roofline-based…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-24 Yunsong Wang , Charlene Yang , Steven Farrell , Yan Zhang , Thorsten Kurth , Samuel Williams

Edge computing has emerged as a pivotal technology, offering significant advantages such as low latency, enhanced data security, and reduced reliance on centralized cloud infrastructure. These benefits are crucial for applications requiring…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-24 Tomasz Szydlo , Viacheslav Horbanov , Devki Nandan Jha , Shashikant Ilager , Aleksander Slominski , Rajiv Ranjan

Moving data through the memory hierarchy is a fundamental bottleneck that can limit the performance of core algorithms of machine learning, such as convolutional neural networks (CNNs). Loop-level optimization, including loop tiling and…

Machine Learning · Computer Science 2021-04-13 Rui Li , Yufan Xu , Aravind Sukumaran-Rajam , Atanas Rountev , P. Sadayappan

Measurements of absolute runtime are useful as a summary of performance when studying parallel visualization and analysis methods on computational platforms of increasing concurrency and complexity. We can obtain even more insights by…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-07 E. Wes Bethel , David Camp , Talita Perciano , Colleen Heinemann

We describe a universal modeling approach for predicting single- and multicore runtime of steady-state loops on server processors. To this end we strictly differentiate between application and machine models: An application model comprises…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-30 Johannes Hofmann , Christie L. Alappat , Georg Hager , Dietmar Fey , Gerhard Wellein

Computational modeling and simulation have become essential tools in the quest to better understand the brain's makeup and to decipher the causal interrelations of its components. The breadth of biochemical and biophysical processes and…

Neurons and Cognition · Quantitative Biology 2019-06-10 Francesco Cremonesi , Felix Schürmann
‹ Prev 1 2 3 10 Next ›