Related papers: Multi-Strided Access Patterns to Boost Hardware Pr…

Memory-Guided Unified Hardware Accelerator for Mixed-Precision Scientific Computing

Recent hardware acceleration advances have enabled powerful specialized accelerators for finite element computations, spiking neural network inference, and sparse tensor operations. However, existing approaches face fundamental limitations:…

Hardware Architecture · Computer Science 2026-01-09 Chuanzhen Wang , Leo Zhang , Eric Liu

Prefetching in Deep Memory Hierarchies with NVRAM as Main Memory

Emerging applications, such as big data analytics and machine learning, require increasingly large amounts of main memory, often exceeding the capacity of current commodity processors built on DRAM technology. To address this, recent…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-27 Manel Lurbe , Miguel Avargues , Salvador Petit , Maria E. Gomez , Rui Yang , Guanhao Wang , Julio Sahuquillo

An Approach to Data Prefetching Using 2-Dimensional Selection Criteria

We propose an approach to data memory prefetching which augments the standard prefetch buffer with selection criteria based on performance and usage pattern of a given instruction. This approach is built on top of a pattern matching based…

Hardware Architecture · Computer Science 2015-05-18 Jean Sung , Sebastian Krupa , Andrew Fishberg , Josef Spjut

Helper Without Threads: Customized Prefetching for Delinquent Irregular Loads

The growing memory footprints of cloud and big data applications mean that data center CPUs can spend significant time waiting for memory. An attractive approach to improving performance in such centralized compute settings is to employ…

Hardware Architecture · Computer Science 2020-09-02 Karthik Sankaranarayanan , Chit-Kwan Lin , Gautham Chinya

CRAM: Efficient Hardware-Based Memory Compression for Bandwidth Enhancement

This paper investigates hardware-based memory compression designs to increase the memory bandwidth. When lines are compressible, the hardware can store multiple lines in a single memory location, and retrieve all these lines in a single…

Hardware Architecture · Computer Science 2018-07-23 Vinson Young , Sanjay Kariyappa , Moinuddin K. Qureshi

Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models

Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures, such as Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs), that excel at accelerating parallel…

Neural and Evolutionary Computing · Computer Science 2023-11-09 Jan Finkbeiner , Thomas Gmeinder , Mark Pupilli , Alexander Titterton , Emre Neftci

Benchmarking mixed-mode PETSc performance on high-performance architectures

The trend towards highly parallel multi-processing is ubiquitous in all modern computer architectures, ranging from handheld devices to large-scale HPC systems; yet many applications are struggling to fully utilise the multiple levels of…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-07-19 Michael Lange , Gerard Gorman , Michele Weiland , Lawrence Mitchell , Xiaohu Guo , James Southern

MMStencil: Optimizing High-order Stencils on Multicore CPU using Matrix Unit

Matrix-accelerated stencil computation is a hot research topic, yet its application to three-dimensional (3D) high-order stencils and HPC remains underexplored. With the emergence of matrix units on multicore CPUs, we analyze matrix-based…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-16 Yinuo Wang , Tianqi Mao , Lin Gan , Wubing Wan , Zeyu Song , Jiayu Fu , Lanke He , Wenqiang Wang , Zekun Yin , Wei Xue , Guangwen Yang

Mixed-mode implementation of PETSc for scalable linear algebra on multi-core processors

With multi-core processors a ubiquitous building block of modern supercomputers, it is now past time to enable applications to embrace these developments in processor design. To achieve exascale performance, applications will need ways of…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-08-13 Michele Weiland , Lawrence Mitchell , Gerard Gorman , Stephan Kramer , Mark Parsons , James Southern

Improving the Efficiency of OpenCL Kernels through Pipes

In an effort to lower the barrier to the adoption of FPGAs by a broader community, today major FPGA vendors offer compiler toolchains for OpenCL code. While using these toolchain allows porting existing code to FPGAs, ensuring performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-01-09 Mostafa Eghbali Zarch , Michela Becchi

Increasing FPGA Accelerators Memory Bandwidth with a Burst-Friendly Memory Layout

Offloading compute-intensive kernels to hardware accelerators relies on the large degree of parallelism offered by these platforms. However, the effective bandwidth of the memory interface often causes a bottleneck, hindering the…

Hardware Architecture · Computer Science 2022-02-25 Corentin Ferry , Tomofumi Yuki , Steven Derrien , Sanjay Rajopadhye

Evaluation of Programming Models and Performance for Stencil Computation on Current GPU Architectures

Accelerated computing is widely used in high-performance computing. Therefore, it is crucial to experiment and discover how to better utilize GPUGPUs latest generations on relevant applications. In this paper, we present results and share…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-13 Baodi Shan , Mauricio Araya-Polo

Easy Acceleration with Distributed Arrays

High level programming languages and GPU accelerators are powerful enablers for a wide range of applications. Achieving scalable vertical (within a compute node), horizontal (across compute nodes), and temporal (over different generations…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-21 Jeremy Kepner , Chansup Byun , LaToya Anderson , William Arcand , David Bestor , William Bergeron , Alex Bonn , Daniel Burrill , Vijay Gadepally , Ryan Haney , Michael Houle , Matthew Hubbell , Hayden Jananthan , Michael Jones , Piotr Luszczek , Lauren Milechin , Guillermo Morales , Julie Mullen , Andrew Prout , Albert Reuther , Antonio Rosa , Charles Yee , Peter Michaleas

Lightweight ML-based Runtime Prefetcher Selection on Many-core Platforms

Modern computer designs support composite prefetching, where multiple individual prefetcher components are used to target different memory access patterns. However, multiple prefetchers competing for resources can drastically hurt…

Hardware Architecture · Computer Science 2023-07-18 Erika S. Alcorta , Mahesh Madhav , Scott Tetrick , Neeraja J. Yadwadkar , Andreas Gerstlauer

Exploring the Performance Benefit of Hybrid Memory System on HPC Environments

Hardware accelerators have become a de-facto standard to achieve high performance on current supercomputers and there are indications that this trend will increase in the future. Modern accelerators feature high-bandwidth memory next to the…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-06-07 Ivy Bo Peng , Roberto Gioiosa , Gokcen Kestor , Erwin Laure , Stefano Markidis

Leveraging shared caches for parallel temporal blocking of stencil codes on multicore processors and clusters

Bandwidth-starved multicore chips have become ubiquitous. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-06-17 Markus Wittmann , Georg Hager , Jan Treibig , Gerhard Wellein

Efficient Architecture for RISC-V Vector Memory Access

Vector processors frequently suffer from inefficient memory accesses, particularly for strided and segment patterns. While coalescing strided accesses is a natural solution, effectively gathering or scattering elements at fixed strides…

Hardware Architecture · Computer Science 2025-04-17 Hongyi Guan , Yichuan Gao , Chenlu Miao , Haoyang Wu , Hang Zhu , Mingfeng Lin , Huayue Liang

On the Applicability of PEBS based Online Memory Access Tracking for Heterogeneous Memory Management at Scale

Operating systems have historically had to manage only a single type of memory device. The imminent availability of heterogeneous memory devices based on emerging memory technologies confronts the classic single memory model and opens a new…

Operating Systems · Computer Science 2020-11-30 Aleix Roca Nonell , Balazs Gerofi , Leonardo Bautista-Gomez , Dominique Martinet , Vicenç Beltran Querol , Yutaka Ishikawa

A Case for Fine-grain Coherence Specialization in Heterogeneous Systems

Hardware specialization is becoming a key enabler of energyefficient performance. Future systems will be increasingly heterogeneous, integrating multiple specialized and programmable accelerators, each with different memory demands.…

Hardware Architecture · Computer Science 2021-04-26 Johnathan Alsop , Weon Taek Na , Matthew D. Sinclair , Samuel Grayson , Sarita V. Adve

Sesquickselect: One and a half pivots for cache-efficient selection

Because of unmatched improvements in CPU performance, memory transfers have become a bottleneck of program execution. As discovered in recent years, this also affects sorting in internal memory. Since partitioning around several pivots…

Data Structures and Algorithms · Computer Science 2019-05-07 Conrado Martínez , Markus Nebel , Sebastian Wild