Related papers: Execution-Cache-Memory Performance Model: Introduc…
Stencil algorithms on regular lattices appear in many fields of computational science, and much effort has been put into optimized implementations. Such activities are usually not guided by performance models that provide estimates of…
This paper presents an in-depth analysis of Intel's Haswell microarchitecture for streaming loop kernels. Among the new features examined is the dual-ring Uncore design, Cluster-on-Die mode, Uncore Frequency Scaling, core improvements as…
Computing-in-Memory (CiM) architectures aim to reduce costly data transfers by performing arithmetic and logic operations in memory and hence relieve the pressure due to the memory wall. However, determining whether a given workload can…
We investigate an approach that uses low-level analysis and the execution-cache-memory (ECM) performance model in combination with tuning of hardware parameters to lower energy requirements of memory-bound applications. The ECM model is…
This paper presents refinements to the execution-cache-memory performance model and a previously published power model for multicore processors. The combination of both enables a very accurate prediction of performance and energy…
Modern multicore chips show complex behavior with respect to performance and power. Starting with the Intel Sandy Bridge processor, it has become possible to directly measure the power dissipation of a CPU chip and correlate this data with…
State-of-art NPUs are typically architected as a self-contained sub-system with multiple heterogeneous hardware computing modules, and a dataflow-driven programming model. There lacks well-established methodology and tools in the industry…
We describe verification techniques for embedded memory systems using efficient memory modeling (EMM), without explicitly modeling each memory bit. We extend our previously proposed approach of EMM in Bounded Model Checking (BMC) for a…
Hardware performance monitoring (HPM) is a crucial ingredient of performance analysis tools. While there are interfaces like LIKWID, PAPI or the kernel interface perf\_event which provide HPM access with some additional features, many…
Complex applications running on multicore processors show a rich performance phenomenology. The growing number of cores per ccNUMA domain complicates performance analysis of memory-bound code since system noise, load imbalance, or…
Hardware platforms in high performance computing are constantly getting more complex to handle even when considering multicore CPUs alone. Numerous features and configuration options in the hardware and the software environment that are…
CPU simulators are vital for computer architecture research, primarily for estimating performance under different programs. This poses challenges for fast and accurate simulation of modern CPUs, especially in multi-core systems. Modern CPU…
We describe a universal modeling approach for predicting single- and multicore runtime of steady-state loops on server processors. To this end we strictly differentiate between application and machine models: An application model comprises…
Large pretrained self-attention neural networks, or transformers, have been very successful in various tasks recently. The performance of a model on a given task depends on its ability to memorize and generalize the training data. Large…
Big science initiatives are trying to reconstruct and model the brain by attempting to simulate brain tissue at larger scales and with increasingly more biological detail than previously thought possible. The exponential growth of parallel…
CXLMemSim is a fast, lightweight simulation framework that enables performance characterization of memory systems based on Compute Express Link (CXL) .mem technology. CXL.mem allows disaggregation and pooling of memory to mitigate memory…
The A64FX CPU powers the current number one supercomputer on the Top500 list. Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. Generating efficient code for…
Memory performance is often the main bottleneck in modern computing systems. In recent years, researchers have attempted to scale the memory wall by leveraging new technology such as CXL, HBM, and in- and near-memory processing. Developers…
Trusted Execution Environments (TEEs), such as Intel Software Guard eXtensions (SGX), are considered as a promising approach to resolve security challenges in clouds. TEEs protect the confidentiality and integrity of application code and…
Homomorphic encryption (HE) allows direct computations on encrypted data. Despite numerous research efforts, the practicality of HE schemes remains to be demonstrated. In this regard, the enormous size of ciphertexts involved in HE…