Related papers: Coloured and task-based stencil codes

Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory

New algorithms and optimization techniques are needed to balance the accelerating trend towards bandwidth-starved multicore chips. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the…

Performance · Computer Science 2012-03-01 Markus Wittmann , Georg Hager , Gerhard Wellein

Leveraging shared caches for parallel temporal blocking of stencil codes on multicore processors and clusters

Bandwidth-starved multicore chips have become ubiquitous. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-06-17 Markus Wittmann , Georg Hager , Jan Treibig , Gerhard Wellein

Beyond 16GB: Out-of-Core Stencil Computations

Stencil computations are a key class of applications, widely used in the scientific computing community, and a class that has particularly benefited from performance improvements on architectures with high memory bandwidth. Unfortunately,…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-10-27 Istvan Z Reguly , Gihan R Mudalige , Michael B Giles

Automatic task-based parallelization of C++ applications by source-to-source transformations

Currently, multi/many-core CPUs are considered standard in most types of computers including, mobile phones, PCs or supercomputers. However, the parallelization of applications as well as refactoring/design of applications for efficient…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-25 Garip Kusoglu , Berenger Bramas , Stephane Genaud

Experiences with task-based programming using cluster nodes as OpenMP devices

Programming a distributed system, such as a cluster, requires extended use of low-level communication libraries and can often become cumbersome and error prone for the average developer. In this work, we consider each node of a cluster as a…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-24 Ilias Keftakis , Vassilios V. Dimakopoulos

Parallel Graph Coloring Algorithms for Distributed GPU Environments

Graph coloring is often used in parallelizing scientific computations that run in distributed and multi-GPU environments; it identifies sets of independent data that can be updated in parallel. Many algorithms exist for graph coloring on a…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-07-02 Ian Bogle , Erik G Boman , Karen D Devine , Sivasankaran Rajamanickam , George M Slota

OpenMP Parallelization of Dynamic Programming and Greedy Algorithms

Multicore has emerged as a typical architecture model since its advent and stands now as a standard. The trend is to increase the number of cores and improve the performance of the memory system. Providing an efficient multicore…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-22 Claude Tadonki

Enabling OpenMP Task Parallelism on Multi-FPGAs

FPGA-based hardware accelerators have received increasing attention mainly due to their ability to accelerate deep pipelined applications, thus resulting in higher computational performance and energy efficiency. Nevertheless, the amount of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-23 R. Nepomuceno , R. Sterle , G. Valarini , M. Pereira , H. Yviquel , G. Araujo

DuctTeip: An efficient programming model for distributed task based parallel computing

Current high-performance computer systems used for scientific computing typically combine shared memory computational nodes in a distributed memory environment. Extracting high performance from these complex systems requires tailored…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-14 Afshin Zafari , Elisabeth Larsson , Martin Tillenius

On Distributed Graph Coloring with Iterative Recoloring

Identifying the sets of operations that can be executed simultaneously is an important problem appearing in many parallel applications. By modeling the operations and their interactions as a graph, one can identify the independent…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-07-28 Ahmet Erdem Sarıyüce , Erik Saule , Ümit V. Çatalyürek

An Efficient Vectorization Scheme for Stencil Computation

Stencil computation is one of the most important kernels in various scientific and engineering applications. A variety of work has focused on vectorization and tiling techniques, aiming at exploiting the in-core data parallelism and data…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-19 Kun Li , Liang Yuan , Yunquan Zhang , Yue Yue , Hang Cao , Pengqi Lu

Efficient Process-to-Node Mapping Algorithms for Stencil Computations

Good process-to-compute-node mappings can be decisive for well performing HPC applications. A special, important class of process-to-node mapping problems is the problem of mapping processes that communicate in a sparse stencil pattern to…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-05-21 Sascha Hunold , Konrad von Kirchbach , Markus Lehr , Christian Schulz , Jesper Larsson Träff

Efficient multicore-aware parallelization strategies for iterative stencil computations

Stencil computations consume a major part of runtime in many scientific simulation codes. As prototypes for this class of algorithms we consider the iterative Jacobi and Gauss-Seidel smoothers and aim at highly efficient parallel…

Performance · Computer Science 2012-03-01 Jan Treibig , Gerhard Wellein , Georg Hager

Multi-threaded Graph Coloring Algorithm for Shared Memory Architecture

In this paper, we present multi-threaded algorithms for graph coloring suitable to the shared memory programming model. We modify an existing algorithm widely used in the literature and prove the correctness of the modified algorithm. We…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-05-11 Nandini Singhal , Sathya Peri , Subrahmanyam Kalyanasundaram

Optimizing ccNUMA locality for task-parallel execution under OpenMP and TBB on multicore-based systems

Task parallelism as employed by the OpenMP task construct or some Intel Threading Building Blocks (TBB) components, although ideal for tackling irregular problems or typical producer/consumer schemes, bears some potential for performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-01-04 Markus Wittmann , Georg Hager

Multicore-optimized wavefront diamond blocking for optimizing stencil updates

The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-07-28 Tareq Malas , Georg Hager , Hatem Ltaief , Holger Stengel , Gerhard Wellein , David Keyes

Greed is Good: Optimistic Algorithms for Bipartite-Graph Partial Coloring on Multicore Architectures

In parallel computing, a valid graph coloring yields a lock-free processing of the colored tasks, data points, etc., without expensive synchronization mechanisms. However, coloring is not free and the overhead can be significant. In…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-01-11 Mustafa Kemal Taş , Kamer Kaya , Erik Saule

A Proof of Concept for Optimizing Task Parallelism by Locality Queues

Task parallelism as employed by the OpenMP task construct, although ideal for tackling irregular problems or typical producer/consumer schemes, bears some potential for performance bottlenecks if locality of data access is important, which…

Performance · Computer Science 2009-02-12 Markus Wittmann , Georg Hager

Asynchronous Runtime with Distributed Manager for Task-based Programming Models

Parallel task-based programming models, like OpenMP, allow application developers to easily create a parallel version of their sequential codes. The standard OpenMP 4.0 introduced the possibility of describing a set of data dependences per…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-09 Jaume Bosch , Carlos Álvarez , Daniel Jiménez-González , Xavier Martorell , Eduard Ayguadé

Block-Relaxation Methods for 3D Constant-Coefficient Stencils on GPUs and Multicore CPUs

Block iterative methods are extremely important as smoothers for multigrid methods, as preconditioners for Krylov methods, and as solvers for diagonally dominant linear systems. Developing robust and efficient algorithms suitable for…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-07-16 Manuel Birke , Bobby Philip , Zhen Wang , Mark Berrill