Related papers: Multi-Strided Access Patterns to Boost Hardware Pr…
Recent hardware acceleration advances have enabled powerful specialized accelerators for finite element computations, spiking neural network inference, and sparse tensor operations. However, existing approaches face fundamental limitations:…
Emerging applications, such as big data analytics and machine learning, require increasingly large amounts of main memory, often exceeding the capacity of current commodity processors built on DRAM technology. To address this, recent…
We propose an approach to data memory prefetching which augments the standard prefetch buffer with selection criteria based on performance and usage pattern of a given instruction. This approach is built on top of a pattern matching based…
The growing memory footprints of cloud and big data applications mean that data center CPUs can spend significant time waiting for memory. An attractive approach to improving performance in such centralized compute settings is to employ…
This paper investigates hardware-based memory compression designs to increase the memory bandwidth. When lines are compressible, the hardware can store multiple lines in a single memory location, and retrieve all these lines in a single…
Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures, such as Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs), that excel at accelerating parallel…
The trend towards highly parallel multi-processing is ubiquitous in all modern computer architectures, ranging from handheld devices to large-scale HPC systems; yet many applications are struggling to fully utilise the multiple levels of…
Matrix-accelerated stencil computation is a hot research topic, yet its application to three-dimensional (3D) high-order stencils and HPC remains underexplored. With the emergence of matrix units on multicore CPUs, we analyze matrix-based…
With multi-core processors a ubiquitous building block of modern supercomputers, it is now past time to enable applications to embrace these developments in processor design. To achieve exascale performance, applications will need ways of…
In an effort to lower the barrier to the adoption of FPGAs by a broader community, today major FPGA vendors offer compiler toolchains for OpenCL code. While using these toolchain allows porting existing code to FPGAs, ensuring performance…
Offloading compute-intensive kernels to hardware accelerators relies on the large degree of parallelism offered by these platforms. However, the effective bandwidth of the memory interface often causes a bottleneck, hindering the…
Accelerated computing is widely used in high-performance computing. Therefore, it is crucial to experiment and discover how to better utilize GPUGPUs latest generations on relevant applications. In this paper, we present results and share…
High level programming languages and GPU accelerators are powerful enablers for a wide range of applications. Achieving scalable vertical (within a compute node), horizontal (across compute nodes), and temporal (over different generations…
Modern computer designs support composite prefetching, where multiple individual prefetcher components are used to target different memory access patterns. However, multiple prefetchers competing for resources can drastically hurt…
Hardware accelerators have become a de-facto standard to achieve high performance on current supercomputers and there are indications that this trend will increase in the future. Modern accelerators feature high-bandwidth memory next to the…
Bandwidth-starved multicore chips have become ubiquitous. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach…
Vector processors frequently suffer from inefficient memory accesses, particularly for strided and segment patterns. While coalescing strided accesses is a natural solution, effectively gathering or scattering elements at fixed strides…
Operating systems have historically had to manage only a single type of memory device. The imminent availability of heterogeneous memory devices based on emerging memory technologies confronts the classic single memory model and opens a new…
Hardware specialization is becoming a key enabler of energyefficient performance. Future systems will be increasingly heterogeneous, integrating multiple specialized and programmable accelerators, each with different memory demands.…
Because of unmatched improvements in CPU performance, memory transfers have become a bottleneck of program execution. As discovered in recent years, this also affects sorting in internal memory. Since partitioning around several pivots…