Related papers: Introducing a Performance Model for Bandwidth-Limi…

Multi-core architectures: Complexities of performance prediction and the impact of cache topology

The balance metric is a simple approach to estimate the performance of bandwidth-limited loop kernels. However, applying the method to in-cache situations and modern multi-core architectures yields unsatisfactory results. This paper…

Performance · Computer Science 2009-10-27 Jan Treibig , Georg Hager , Gerhard Wellein

An analytic performance model for overlapping execution of memory-bound loop kernels on multicore CPUs

Complex applications running on multicore processors show a rich performance phenomenology. The growing number of cores per ccNUMA domain complicates performance analysis of memory-bound code since system noise, load imbalance, or…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-03 Ayesha Afzal , Georg Hager , Gerhard Wellein

On Performance Modeling for MANETs under General Limited Buffer Constraint

Understanding the real achievable performance of mobile ad hoc networks (MANETs) under practical network constraints is of great importance for their applications in future highly heterogeneous wireless network environments. This paper…

Information Theory · Computer Science 2017-06-02 Jia Liu , Yang Xu , Yulong Shen , Xiaohong Jiang , Tarik Taleb

Analysis of Intel's Haswell Microarchitecture Using The ECM Model and Microbenchmarks

This paper presents an in-depth analysis of Intel's Haswell microarchitecture for streaming loop kernels. Among the new features examined is the dual-ring Uncore design, Cluster-on-Die mode, Uncore Frequency Scaling, core improvements as…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-11-16 Johannes Hofmann , Dietmar Fey , Jan Eitzinger , Georg Hager , Gerhard Wellein

A Performance Model for Warp Specialization Kernels

This paper presents a performance model tailored for warp specialization kernels, focusing on factors such as warp size, tilling size, input matrix size, memory bandwidth, and thread divergence. Our model offers accurate predictions of…

Programming Languages · Computer Science 2025-06-18 Zhengyang Liu , Vinod Grover

A Unified, Hardware-Fitted, Cross-GPU Performance Model

We present a mechanism to symbolically gather performance-relevant operation counts from numerically-oriented subprograms (`kernels') expressed in the Loopy programming system, and apply these counts in a simple, linear model of kernel run…

Performance · Computer Science 2016-04-19 James Stevens , Andreas Klöckner

Automatic Loop Kernel Analysis and Performance Modeling With Kerncraft

Analytic performance models are essential for understanding the performance characteristics of loop kernels, which consume a major part of CPU cycles in computational science. Starting from a validated performance model one can infer the…

Performance · Computer Science 2015-11-06 Julian Hammer , Georg Hager , Jan Eitzinger , Gerhard Wellein

On the accuracy and usefulness of analytic energy models for contemporary multicore processors

This paper presents refinements to the execution-cache-memory performance model and a previously published power model for multicore processors. The combination of both enables a very accurate prediction of performance and energy…

Performance · Computer Science 2018-07-09 Johannes Hofmann , Georg Hager , Dietmar Fey

Performance Portability Study of Linear Algebra Kernels in OpenCL

The performance portability of OpenCL kernel implementations for common memory bandwidth limited linear algebra operations across different hardware generations of the same vendor as well as across vendors is studied. Certain combinations…

Mathematical Software · Computer Science 2022-11-03 Karl Rupp , Philippe Tillet , Florian Rudolf , Josef Weinbub , Tibor Grasser , Ansgar Jüngel

Leveraging shared caches for parallel temporal blocking of stencil codes on multicore processors and clusters

Bandwidth-starved multicore chips have become ubiquitous. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-06-17 Markus Wittmann , Georg Hager , Jan Treibig , Gerhard Wellein

An ECM-based energy-efficiency optimization approach for bandwidth-limited streaming kernels on recent Intel Xeon processors

We investigate an approach that uses low-level analysis and the execution-cache-memory (ECM) performance model in combination with tuning of hardware parameters to lower energy requirements of memory-bound applications. The ECM model is…

Performance · Computer Science 2016-09-13 Johannes Hofmann , Dietmar Fey

Understanding HPC Benchmark Performance on Intel Broadwell and Cascade Lake Processors

Hardware platforms in high performance computing are constantly getting more complex to handle even when considering multicore CPUs alone. Numerous features and configuration options in the hardware and the software environment that are…

Performance · Computer Science 2020-06-25 Christie L. Alappat , Johannes Hofmann , Georg Hager , Holger Fehske , Alan R. Bishop , Gerhard Wellein

Performance Analysis of Traditional and Data-Parallel Primitive Implementations of Visualization and Analysis Kernels

Measurements of absolute runtime are useful as a summary of performance when studying parallel visualization and analysis methods on computational platforms of increasing concurrency and complexity. We can obtain even more insights by…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-07 E. Wes Bethel , David Camp , Talita Perciano , Colleen Heinemann

CAWL: A Cache-aware Write Performance Model of Linux Systems

The performance of data intensive applications is often dominated by their input/output (I/O) operations but the I/O stack of systems is complex and severely depends on system specific settings and hardware components. This situation makes…

Performance · Computer Science 2023-06-12 Masoud Gholami , Florian Schintke

Streaming-capable High-performance Architecture of Learned Image Compression Codecs

Learned image compression allows achieving state-of-the-art accuracy and compression ratios, but their relatively slow runtime performance limits their usage. While previous attempts on optimizing learned image codecs focused more on the…

Image and Video Processing · Electrical Eng. & Systems 2022-08-04 Fangzheng Lin , Heming Sun , Jiro Katto

Performance of Cache Memory Subsystems for Multicore Architectures

Advancements in multi-core have created interest among many research groups in finding out ways to harness the true power of processor cores. Recent research suggests that on-board component such as cache memory plays a crucial role in…

Hardware Architecture · Computer Science 2011-11-15 N. Ramasubramanian , Srinivas V. V. , N. Ammasai Gounden

Optimally Scheduling CNN Convolutions for Efficient Memory Access

Embedded inference engines for convolutional networks must be parsimonious in memory bandwidth and buffer sizing to meet power and cost constraints. We present an analytical memory bandwidth model for loop-nest optimization targeting…

Neural and Evolutionary Computing · Computer Science 2019-02-06 Arthur Stoutchinin , Francesco Conti , Luca Benini

Bridging the Architecture Gap: Abstracting Performance-Relevant Properties of Modern Server Processors

We describe a universal modeling approach for predicting single- and multicore runtime of steady-state loops on server processors. To this end we strictly differentiate between application and machine models: An application model comprises…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-30 Johannes Hofmann , Christie L. Alappat , Georg Hager , Dietmar Fey , Gerhard Wellein

Estimate The Efficiency Of Multiprocessor's Cash Memory Work Algorithms

Many computer systems for calculating the proper organization of memory are among the most critical issues. Using a tier cache memory (along with branching prediction) is an effective means of increasing modern multi-core processors'…

Networking and Internet Architecture · Computer Science 2021-05-21 Mohamed A. Hamada , Abdelrahman Abdallah

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures: A Machine Learning Based Approach

This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-03-10 Peng Zhang , Jianbin Fang , Canqun Yang , Chun Huang , Tao Tang , Zheng Wang