Related papers: Introducing a Performance Model for Bandwidth-Limi…
The balance metric is a simple approach to estimate the performance of bandwidth-limited loop kernels. However, applying the method to in-cache situations and modern multi-core architectures yields unsatisfactory results. This paper…
Complex applications running on multicore processors show a rich performance phenomenology. The growing number of cores per ccNUMA domain complicates performance analysis of memory-bound code since system noise, load imbalance, or…
Understanding the real achievable performance of mobile ad hoc networks (MANETs) under practical network constraints is of great importance for their applications in future highly heterogeneous wireless network environments. This paper…
This paper presents an in-depth analysis of Intel's Haswell microarchitecture for streaming loop kernels. Among the new features examined is the dual-ring Uncore design, Cluster-on-Die mode, Uncore Frequency Scaling, core improvements as…
This paper presents a performance model tailored for warp specialization kernels, focusing on factors such as warp size, tilling size, input matrix size, memory bandwidth, and thread divergence. Our model offers accurate predictions of…
We present a mechanism to symbolically gather performance-relevant operation counts from numerically-oriented subprograms (`kernels') expressed in the Loopy programming system, and apply these counts in a simple, linear model of kernel run…
Analytic performance models are essential for understanding the performance characteristics of loop kernels, which consume a major part of CPU cycles in computational science. Starting from a validated performance model one can infer the…
This paper presents refinements to the execution-cache-memory performance model and a previously published power model for multicore processors. The combination of both enables a very accurate prediction of performance and energy…
The performance portability of OpenCL kernel implementations for common memory bandwidth limited linear algebra operations across different hardware generations of the same vendor as well as across vendors is studied. Certain combinations…
Bandwidth-starved multicore chips have become ubiquitous. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach…
We investigate an approach that uses low-level analysis and the execution-cache-memory (ECM) performance model in combination with tuning of hardware parameters to lower energy requirements of memory-bound applications. The ECM model is…
Hardware platforms in high performance computing are constantly getting more complex to handle even when considering multicore CPUs alone. Numerous features and configuration options in the hardware and the software environment that are…
Measurements of absolute runtime are useful as a summary of performance when studying parallel visualization and analysis methods on computational platforms of increasing concurrency and complexity. We can obtain even more insights by…
The performance of data intensive applications is often dominated by their input/output (I/O) operations but the I/O stack of systems is complex and severely depends on system specific settings and hardware components. This situation makes…
Learned image compression allows achieving state-of-the-art accuracy and compression ratios, but their relatively slow runtime performance limits their usage. While previous attempts on optimizing learned image codecs focused more on the…
Advancements in multi-core have created interest among many research groups in finding out ways to harness the true power of processor cores. Recent research suggests that on-board component such as cache memory plays a crucial role in…
Embedded inference engines for convolutional networks must be parsimonious in memory bandwidth and buffer sizing to meet power and cost constraints. We present an analytical memory bandwidth model for loop-nest optimization targeting…
We describe a universal modeling approach for predicting single- and multicore runtime of steady-state loops on server processors. To this end we strictly differentiate between application and machine models: An application model comprises…
Many computer systems for calculating the proper organization of memory are among the most critical issues. Using a tier cache memory (along with branching prediction) is an effective means of increasing modern multi-core processors'…
This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a…