Related papers: Efficient Process-to-Node Mapping Algorithms for S…

High-Quality Hierarchical Process Mapping

Partitioning graphs into blocks of roughly equal size such that few edges run between blocks is a frequently needed operation when processing graphs on a parallel computer. When a topology of a distributed system is known an important task…

Data Structures and Algorithms · Computer Science 2020-01-23 Marcelo Fonseca Faraj , Alexander van der Grinten , Henning Meyerhenke , Jesper Larsson Träff , Christian Schulz

Scalable communication for high-order stencil computations using CUDA-aware MPI

Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths,…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-11 Johannes Pekkilä , Miikka S. Väisälä , Maarit J. Käpylä , Matthias Rheinhardt , Oskar Lappi

Mapping Stencils on Coarse-grained Reconfigurable Spatial Architecture

Stencils represent a class of computational patterns where an output grid point depends on a fixed shape of neighboring points in an input grid. Stencil computations are prevalent in scientific applications engaging a significant portion of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-24 Jesmin Jahan Tithi , Fabrizio Petrini , Hongbo Rong , Andrei Valentin , Carl Ebeling

A survey on scheduling and mapping techniques in 3D Network-on-chip

Network-on-Chips (NoCs) have been widely employed in the design of multiprocessor system-on-chips (MPSoCs) as a scalable communication solution. NoCs enable communications between on-chip Intellectual Property (IP) cores and allow those…

Hardware Architecture · Computer Science 2022-11-07 Simran Preet Kaur , Manojit Ghose , Ananya Pathak , Rutuja Patole

Persistent and Partitioned MPI for Stencil Communication

Many parallel applications rely on iterative stencil operations, whose performance are dominated by communication costs at large scales. Several MPI optimizations, such as persistent and partitioned communication, reduce overheads and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-20 Gerald Collom , Jason Burmark , Olga Pearce , Amanda Bienz

Collecting and Presenting Reproducible Intranode Stencil Performance: INSPECT

Stencil algorithms have been receiving considerable interest in HPC research for decades. The techniques used to approach multi-core stencil performance modeling and engineering span basic runtime measurements, elaborate performance models,…

Performance · Computer Science 2020-06-25 Julian Hornich , Julian Hammer , Georg Hager , Thomas Gruber , Gerhard Wellein

High Performance Network-on-Chips (NoCs) Design: Performance Modeling, Routing Algorithm and Architecture Optimization

With technology scaling down, hundreds and thousands processing elements (PEs) can be integrated on a single chip. Network-on-chip (NoC) has been proposed as an efficient solution to handle this distinctive challenge. In this thesis, we…

Other Computer Science · Computer Science 2014-06-17 Zhiliang Qian

MMStencil: Optimizing High-order Stencils on Multicore CPU using Matrix Unit

Matrix-accelerated stencil computation is a hot research topic, yet its application to three-dimensional (3D) high-order stencils and HPC remains underexplored. With the emergence of matrix units on multicore CPUs, we analyze matrix-based…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-16 Yinuo Wang , Tianqi Mao , Lin Gan , Wubing Wan , Zeyu Song , Jiayu Fu , Lanke He , Wenqiang Wang , Zekun Yin , Wei Xue , Guangwen Yang

Towards a decentralized algorithm for mapping network and computational resources for distributed data-flow computations

Several high-throughput distributed data-processing applications require multi-hop processing of streams of data. These applications include continual processing on data streams originating from a network of sensors, composing a multimedia…

Distributed, Parallel, and Cluster Computing · Computer Science 2009-03-26 Shah Asaduzzaman , Muthucumaru Maheswaran

A Portable Framework for Accelerating Stencil Computations on Modern Node Architectures

Finite-difference methods based on high-order stencils are widely used in seismic simulations, weather forecasting, computational fluid dynamics, and other scientific applications. Achieving HPC-level stencil computations on one…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-09 Ryuichi Sai , John Mellor-Crummey , Jinfan Xu , Mauricio Araya-Polo

Better Process Mapping and Sparse Quadratic Assignment

Communication and topology aware process mapping is a powerful approach to reduce communication time in parallel applications with known communication patterns on large, distributed memory systems. We address the problem as a quadratic…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-07-23 Christian Schulz , Jesper Larsson Träff , Konrad von Kirchbach

Shared-Memory Hierarchical Process Mapping

Modern large-scale scientific applications consist of thousands to millions of individual tasks. These tasks involve not only computation but also communication with one another. Typically, the communication pattern between tasks is sparse…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-03 Christian Schulz , Henning Woydt

StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems

Spatial computing devices have been shown to significantly accelerate stencil computations, but have so far relied on unrolling the iterative dimension of a single stencil operation to increase temporal locality. This work considers the…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-01-12 Johannes de Fine Licht , Andreas Kuster , Tiziano De Matteis , Tal Ben-Nun , Dominic Hofer , Torsten Hoefler

Stencil Matrixization

Current architectures are now equipped with matrix computation units designed to enhance AI and high-performance computing applications. Within these architectures, two fundamental instruction types are matrix multiplication and vector…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-04 Wenxuan Zhao , Liang Yuan , Baicheng Yan , Penghao Ma , Yunquan Zhang , Long Wang , Zhe Wang

A Novel Process Mapping Strategy in Clustered Environments

Nowadays the number of available processing cores within computing nodes which are used in recent clustered environments, are growing up with a rapid rate. Despite this trend, the number of available network interfaces in such computing…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-07-13 Mohsen Soryani , Morteza Analoui , Ghobad Zarrinchian

Improving Memory Hierarchy Utilisation for Stencil Computations on Multicore Machines

Although modern supercomputers are composed of multicore machines, one can find scientists that still execute their legacy applications which were developed to monocore cluster where memory hierarchy is dedicated to a sole core. The main…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-10-31 Alexandre Sena , Aline Nascimento , Cristina Boeres , Vinod E. F. Rebello , André Bulcão

Accelerating High-Order Stencils on GPUs

Stencil computations are widely used in HPC applications. Today, many HPC platforms use GPUs as accelerators. As a result, understanding how to perform stencil computations fast on GPUs is important. While implementation strategies for…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-16 Ryuichi Sai , John Mellor-Crummey , Xiaozhu Meng , Mauricio Araya-Polo , Jie Meng

Mapping Matters: Application Process Mapping on 3-D Processor Topologies

Applications' performance is influenced by the mapping of processes to computing nodes, the frequency and volume of exchanges among processing elements, the network capacity, and the routing protocol. A poor mapping of application processes…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-11 Jonas H. Müller Korndörfer , Mario Bielert , Laércio L. Pilla , Florina M. Ciorba

An MPI-based Algorithm for Mapping Complex Networks onto Hierarchical Architectures

Processing massive application graphs on distributed memory systems requires to map the graphs onto the system's processing elements (PEs). This task becomes all the more important when PEs have non-uniform communication costs or the input…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-07-07 Maria Predari , Charilaos Tzovas , Christian Schulz , Henning Meyerhenke

Fast Stencil Computations using Fast Fourier Transforms

Stencil computations are widely used to simulate the change of state of physical systems across a multidimensional grid over multiple timesteps. The state-of-the-art techniques in this area fall into three groups: cache-aware tiled looping…

Data Structures and Algorithms · Computer Science 2021-05-17 Zafar Ahmad , Rezaul Chowdhury , Rathish Das , Pramod Ganapathi , Aaron Gregory , Yimin Zhu