Related papers: Fast Stencil-Code Computation on a Wafer-Scale Pro…

Stencil Computations on Cerebras Wafer-Scale Engine

Stencil computations are a fundamental kernel in scientific computing, critical for simulations in domains such as fluid dynamics and climate modeling. However, these computations are often memory-bound on traditional High-Performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-11 Elia Belli , Daniele De Sensi

An MLIR Lowering Pipeline for Stencils at Wafer-Scale

The Cerebras Wafer-Scale Engine (WSE) delivers performance at an unprecedented scale of over 900,000 compute units, all connected via a single-wafer on-chip interconnect. Initially designed for AI, the WSE architecture is also well-suited…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-27 Nicolai Stawinoga , David Katz , Anton Lydike , Justs Zarins , Nick Brown , George Bisbas , Tobias Grosser

TensorFlow as a DSL for stencil-based computation on the Cerebras Wafer Scale Engine

The Cerebras Wafer Scale Engine (WSE) is an accelerator that combines hundreds of thousands of AI-cores onto a single chip. Whilst this technology has been designed for machine learning workloads, the significant amount of available raw…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-10-11 Nick Brown , Brandon Echols , Justs Zarins , Tobias Grosser

Beyond 16GB: Out-of-Core Stencil Computations

Stencil computations are a key class of applications, widely used in the scientific computing community, and a class that has particularly benefited from performance improvements on architectures with high memory bandwidth. Unfortunately,…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-10-27 Istvan Z Reguly , Gihan R Mudalige , Michael B Giles

Massively scalable stencil algorithm

Stencil computations lie at the heart of many scientific and industrial applications. Unfortunately, stencil algorithms perform poorly on machines with cache based memory hierarchy, due to low re-use of memory accesses. This work shows that…

Mathematical Software · Computer Science 2022-04-11 Mathias Jacquelin , Mauricio Araya-Polo , Jie Meng

High-Performance High-Order Stencil Computation on FPGAs Using OpenCL

In this paper we evaluate the performance of FPGAs for high-order stencil computation using High-Level Synthesis. We show that despite the higher computation intensity and on-chip memory requirement of such stencils compared to first-order…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-17 Hamid Reza Zohouri , Artur Podobas , Satoshi Matsuoka

Wafer-Scale Fast Fourier Transforms

We have implemented fast Fourier transforms for one, two, and three-dimensional arrays on the Cerebras CS-2, a system whose memory and processing elements reside on a single silicon wafer. The wafer-scale engine (WSE) encompasses a…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-26 Marcelo Orenes-Vera , Ilya Sharapov , Robert Schreiber , Mathias Jacquelin , Philippe Vandermersch , Sharan Chetlur

Casper: Accelerating Stencil Computation using Near-cache Processing

Stencil computation is one of the most used kernels in a wide variety of scientific applications, ranging from large-scale weather prediction to solving partial differential equations. Stencil computations are characterized by three unique…

Hardware Architecture · Computer Science 2023-09-07 Alain Denzler , Rahul Bera , Nastaran Hajinazar , Gagandeep Singh , Geraldo F. Oliveira , Juan Gómez-Luna , Onur Mutlu

Accelerating High-Order Stencils on GPUs

Stencil computations are widely used in HPC applications. Today, many HPC platforms use GPUs as accelerators. As a result, understanding how to perform stencil computations fast on GPUs is important. While implementation strategies for…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-16 Ryuichi Sai , John Mellor-Crummey , Xiaozhu Meng , Mauricio Araya-Polo , Jie Meng

A Portable Framework for Accelerating Stencil Computations on Modern Node Architectures

Finite-difference methods based on high-order stencils are widely used in seismic simulations, weather forecasting, computational fluid dynamics, and other scientific applications. Achieving HPC-level stencil computations on one…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-09 Ryuichi Sai , John Mellor-Crummey , Jinfan Xu , Mauricio Araya-Polo

Breaking the Molecular Dynamics Timescale Barrier Using a Wafer-Scale System

Molecular dynamics (MD) simulations have transformed our understanding of the nanoscale, driving breakthroughs in materials science, computational chemistry, and several other fields, including biophysics and drug design. Even on exascale…

Computational Physics · Physics 2024-12-30 Kylee Santos , Stan Moore , Tomas Oppelstrup , Amirali Sharifian , Ilya Sharapov , Aidan Thompson , Delyan Z Kalchev , Danny Perez , Robert Schreiber , Scott Pakin , Edgar A Leon , James H Laros , Michael James , Sivasankaran Rajamanickam

High Performance Computing with FPGAs and OpenCL

In this work we evaluate the potential of FPGAs for accelerating HPC workloads as a more power-efficient alternative to GPUs. Using High-Level Synthesis and a large set of optimization techniques, we show that FPGAs can achieve better…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-17 Hamid Reza Zohouri

A Generic Library for Stencil Computations

In this era of diverse and heterogeneous computer architectures, the programmability issues, such as productivity and portable efficiency, are crucial to software development and algorithm design. One way to approach the problem is to step…

Mathematical Software · Computer Science 2012-07-10 Mauro Bianco , Ugo Varetto

Accelerating GPU-Based Out-of-Core Stencil Computation with On-the-Fly Compression

Stencil computation is an important class of scientific applications that can be efficiently executed by graphics processing units (GPUs). Out-of-core approach helps run large scale stencil codes that process data with sizes larger than the…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-14 Jingcheng Shen , Yifan Wu , Masao Okita , Fumihiko Ino

Scalable communication for high-order stencil computations using CUDA-aware MPI

Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths,…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-11 Johannes Pekkilä , Miikka S. Väisälä , Maarit J. Käpylä , Matthias Rheinhardt , Oskar Lappi

Compression-Based Optimizations for Out-of-Core GPU Stencil Computation

An out-of-core stencil computation code handles large data whose size is beyond the capacity of GPU memory. Whereas, such an code requires streaming data to and from the GPU frequently. As a result, data movement between the CPU and GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-04-26 Jingcheng Shen , Xin Deng , Yifan Wu , Masao Okita , Fumihiko Ino

Block-Relaxation Methods for 3D Constant-Coefficient Stencils on GPUs and Multicore CPUs

Block iterative methods are extremely important as smoothers for multigrid methods, as preconditioners for Krylov methods, and as solvers for diagonally dominant linear systems. Developing robust and efficient algorithms suitable for…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-07-16 Manuel Birke , Bobby Philip , Zhen Wang , Mark Berrill

High-Level FPGA Accelerator Design for Structured-Mesh-Based Explicit Numerical Solvers

This paper presents a workflow for synthesizing near-optimal FPGA implementations for structured-mesh based stencil applications for explicit solvers. It leverages key characteristics of the application class, its computation-communication…

Hardware Architecture · Computer Science 2021-01-08 Kamalavasan Kamalakkannan , Gihan R. Mudalige , Istvan Z. Reguly , Suhaib A. Fahmy

Stencil Computations on AMD and Nvidia Graphics Processors: Performance and Tuning Strategies

Over the last ten years, graphics processors have become the de facto accelerator for data-parallel tasks in various branches of high-performance computing, including machine learning and computational sciences. However, with the recent…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-28 Johannes Pekkilä , Oskar Lappi , Fredrik Robertsén , Maarit J. Korpi-Lagg

A Comparison of the Cerebras Wafer-Scale Integration Technology with Nvidia GPU-based Systems for Artificial Intelligence

Cerebras' wafer-scale engine (WSE) technology merges multiple dies on a single wafer. It addresses the challenges of memory bandwidth, latency, and scalability, making it suitable for artificial intelligence. This work evaluates the WSE-3…

Hardware Architecture · Computer Science 2025-03-18 Yudhishthira Kundu , Manroop Kaur , Tripty Wig , Kriti Kumar , Pushpanjali Kumari , Vivek Puri , Manish Arora