Related papers: Massively scalable stencil algorithm

Stencil Computations on Cerebras Wafer-Scale Engine

Stencil computations are a fundamental kernel in scientific computing, critical for simulations in domains such as fluid dynamics and climate modeling. However, these computations are often memory-bound on traditional High-Performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-11 Elia Belli , Daniele De Sensi

Beyond 16GB: Out-of-Core Stencil Computations

Stencil computations are a key class of applications, widely used in the scientific computing community, and a class that has particularly benefited from performance improvements on architectures with high memory bandwidth. Unfortunately,…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-10-27 Istvan Z Reguly , Gihan R Mudalige , Michael B Giles

An MLIR Lowering Pipeline for Stencils at Wafer-Scale

The Cerebras Wafer-Scale Engine (WSE) delivers performance at an unprecedented scale of over 900,000 compute units, all connected via a single-wafer on-chip interconnect. Initially designed for AI, the WSE architecture is also well-suited…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-27 Nicolai Stawinoga , David Katz , Anton Lydike , Justs Zarins , Nick Brown , George Bisbas , Tobias Grosser

Tight Bounds for Low Dimensional Star Stencils in the Parallel External Memory Model

Stencil computations on low dimensional grids are kernels of many scientific applications including finite difference methods used to solve partial differential equations. On typical modern computer architectures, such stencil computations…

Computational Complexity · Computer Science 2015-01-23 Philipp Hupp , Riko Jacob

Improving Memory Hierarchy Utilisation for Stencil Computations on Multicore Machines

Although modern supercomputers are composed of multicore machines, one can find scientists that still execute their legacy applications which were developed to monocore cluster where memory hierarchy is dedicated to a sole core. The main…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-10-31 Alexandre Sena , Aline Nascimento , Cristina Boeres , Vinod E. F. Rebello , André Bulcão

Fast Stencil-Code Computation on a Wafer-Scale Processor

The performance of CPU-based and GPU-based systems is often low for PDE codes, where large, sparse, and often structured systems of linear equations must be solved. Iterative solvers are limited by data movement, both between caches and…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-09 Kamil Rocki , Dirk Van Essendelft , Ilya Sharapov , Robert Schreiber , Michael Morrison , Vladimir Kibardin , Andrey Portnoy , Jean Francois Dietiker , Madhava Syamlal , Michael James

Efficient cache use for stencil operations on structured discretization grids

We derive tight bounds on cache misses for evaluation of explicit stencil operators on structured grids. Our lower bound is based on the isoperimetrical property of the discrete octahedron. Our upper bound is based on good surface to volume…

Performance · Computer Science 2007-05-23 Michael A. Frumkin , Rob F. Van der Wijngaart

A Portable Framework for Accelerating Stencil Computations on Modern Node Architectures

Finite-difference methods based on high-order stencils are widely used in seismic simulations, weather forecasting, computational fluid dynamics, and other scientific applications. Achieving HPC-level stencil computations on one…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-09 Ryuichi Sai , John Mellor-Crummey , Jinfan Xu , Mauricio Araya-Polo

Casper: Accelerating Stencil Computation using Near-cache Processing

Stencil computation is one of the most used kernels in a wide variety of scientific applications, ranging from large-scale weather prediction to solving partial differential equations. Stencil computations are characterized by three unique…

Hardware Architecture · Computer Science 2023-09-07 Alain Denzler , Rahul Bera , Nastaran Hajinazar , Gagandeep Singh , Geraldo F. Oliveira , Juan Gómez-Luna , Onur Mutlu

Accelerating High-Order Stencils on GPUs

Stencil computations are widely used in HPC applications. Today, many HPC platforms use GPUs as accelerators. As a result, understanding how to perform stencil computations fast on GPUs is important. While implementation strategies for…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-16 Ryuichi Sai , John Mellor-Crummey , Xiaozhu Meng , Mauricio Araya-Polo , Jie Meng

Mapping Stencils on Coarse-grained Reconfigurable Spatial Architecture

Stencils represent a class of computational patterns where an output grid point depends on a fixed shape of neighboring points in an input grid. Stencil computations are prevalent in scientific applications engaging a significant portion of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-24 Jesmin Jahan Tithi , Fabrizio Petrini , Hongbo Rong , Andrei Valentin , Carl Ebeling

SparStencil: Retargeting Sparse Tensor Cores to Scientific Stencil Computations via Structured Sparsity Transformation

Sparse Tensor Cores offer exceptional performance gains for AI workloads by exploiting structured 2:4 sparsity. However, their potential remains untapped for core scientific workloads such as stencil computations, which exhibit irregular…

Computational Engineering, Finance, and Science · Computer Science 2025-07-01 Qi Li , Kun Li , Haozhi Han , Liang Yuan , Junshi Chen , Yunquan Zhang , Yifeng Chen , Hong An , Ting Cao , Mao Yang

Do We Need Tensor Cores for Stencil Computations?

Stencil computation constitutes a cornerstone of scientific computing, serving as a critical kernel in domains ranging from fluid dynamics to weather simulation. While stencil computations are conventionally regarded as memory-bound and…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-03 Qiqi Gu , Chenpeng Wu , Heng Shi , Jianguo Yao , Haibing Guan

An Efficient Vectorization Scheme for Stencil Computation

Stencil computation is one of the most important kernels in various scientific and engineering applications. A variety of work has focused on vectorization and tiling techniques, aiming at exploiting the in-core data parallelism and data…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-19 Kun Li , Liang Yuan , Yunquan Zhang , Yue Yue , Hang Cao , Pengqi Lu

Loop Tiling in Large-Scale Stencil Codes at Run-time with OPS

The key common bottleneck in most stencil codes is data movement, and prior research has shown that improving data locality through optimisations that schedule across loops do particularly well. However, in many large PDE applications it is…

Performance · Computer Science 2017-11-30 Istvan Z Reguly , Gihan R Mudalige , Mike B Giles

Efficient multicore-aware parallelization strategies for iterative stencil computations

Stencil computations consume a major part of runtime in many scientific simulation codes. As prototypes for this class of algorithms we consider the iterative Jacobi and Gauss-Seidel smoothers and aim at highly efficient parallel…

Performance · Computer Science 2012-03-01 Jan Treibig , Gerhard Wellein , Georg Hager

AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs

Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-04 Kazuaki Matsumura , Hamid Reza Zohouri , Mohamed Wahib , Toshio Endo , Satoshi Matsuoka

Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model

Stencil algorithms on regular lattices appear in many fields of computational science, and much effort has been put into optimized implementations. Such activities are usually not guided by performance models that provide estimates of…

Performance · Computer Science 2016-01-28 Holger Stengel , Jan Treibig , Georg Hager , Gerhard Wellein

Communication Lower Bounds and Algorithms for Sketching with Random Dense Matrices

Sketching is widely used in randomized linear algebra for low-rank matrix approximation, column subset selection, and many other problems, and it has gained significant traction in machine learning applications. However, sketching large…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-24 Hussam Al Daas , Grey Ballard , Laura Grigori , Md Taufique Hussain , Suraj Kumar , Mohammad Marufur Rahman , Kathryn Rouse

TensorFlow as a DSL for stencil-based computation on the Cerebras Wafer Scale Engine

The Cerebras Wafer Scale Engine (WSE) is an accelerator that combines hundreds of thousands of AI-cores onto a single chip. Whilst this technology has been designed for machine learning workloads, the significant amount of available raw…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-10-11 Nick Brown , Brandon Echols , Justs Zarins , Tobias Grosser