Related papers: OMI4papps: Optimisation, Modelling and Implementat…

Applications of CMOS technology at the ALICE experiment

Monolithic Active Pixel Sensors (MAPS) combine the sensing part and the front-end electronics in the same silicon layer, making use of CMOS technology. Profiting from the progresses of this commercial process, MAPS have been undergoing…

Instrumentation and Detectors · Physics 2024-08-06 Domenico Colella

LB4OMP: A Dynamic Load Balancing Library for Multithreaded Applications

Exascale computing systems will exhibit high degrees of hierarchical parallelism, with thousands of computing nodes and hundreds of cores per node. Efficiently exploiting hierarchical parallelism is challenging due to load imbalance that…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-29 Jonas H. Müller Korndörfer , Ahmed Eleliemy , Ali Mohammed , Florina M. Ciorba

Optimizing the Performance of Reactive Molecular Dynamics Simulations for Multi-Core Architectures

Reactive molecular dynamics simulations are computationally demanding. Reaching spatial and temporal scales where interesting scientific phenomena can be observed requires efficient and scalable implementations on modern hardware. In this…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-06-26 Hasan Metin Aktulga , Christopher Knight , Paul Coffman , Kurt A. O'Hearn , Tzu-Ray Shan , Wei Jiang

Exploring Fine-grained Task Parallelism on Simultaneous Multithreading Cores

Nowadays, latency-critical, high-performance applications are parallelized even on power-constrained client systems to improve performance. However, an important scenario of fine-grained tasking on simultaneous multithreading CPU cores in…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-03 Denis Los , Igor Petushkov

Optimizing Xeon Phi for Interactive Data Analysis

The Intel Xeon Phi manycore processor is designed to provide high performance matrix computations of the type often performed in data analysis. Common data analysis environments include Matlab, GNU Octave, Julia, Python, and R. Achieving…

Performance · Computer Science 2019-12-03 Chansup Byun , Jeremy Kepner , William Arcand , David Bestor , William Bergeron , Matthew Hubbell , Vijay Gadepally , Michael Houle , Michael Jones , Anne Klein , Lauren Milechin , Peter Michaleas , Julie Mullen , Andrew Prout , Antonio Rosa , Siddharth Samsi , Charles Yee , Albert Reuther

Efficient Implementations of Molecular Dynamics Simulations for Lennard-Jones Systems

Efficient implementations of the classical molecular dynamics (MD) method for Lennard-Jones particle systems are considered. Not only general algorithms but also techniques that are efficient for some specific CPU architectures are also…

Statistical Mechanics · Physics 2015-03-17 H. Watanabe , M. Suzuki , N. Ito

Coarse-Grain Performance Estimator for Heterogeneous Parallel Computing Architectures like Zynq All-Programmable SoC

Heterogeneous computing is emerging as a mandatory requirement for power-efficient system design. With this aim, modern heterogeneous platforms like Zynq All-Programmable SoC, that integrates ARM-based SMP and programmable logic, have been…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-08-28 Daniel Jiménez-González , Carlos Álvarez , Antonio Filgueras , Xavier Martorell , Jan Langer , Juanjo Noguera , Kees Vissers

A Practical GPU-Accelerated Implementation of Orthogonal Matching Pursuit

Finding the sparsest solution to the underdetermined system $\mathbf{y}=\mathbf{Ax}$, given a tolerance, is known to be NP-hard. Many approximate solutions to this problem exist, and Orthogonal Matching Pursuit (OMP) is one of the most…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-01 Ariel Lubonja , Sebastian Kazmarek Praesius , Trac Duy Tran

A Hybrid Parallelization of AIM for Multi-Core Clusters: Implementation Details and Benchmark Results on Ranger

This paper presents implementation details and empirical results for a hybrid message passing and shared memory paralleliziation of the adaptive integral method (AIM). AIM is implemented on a (near) petaflop supercomputing cluster of…

Computational Engineering, Finance, and Science · Computer Science 2010-10-08 Fangzhou Wei , Ali E. Yılmaz

Scaling the memory wall using mixed-precision -- HPG-MxP on an exascale machine

Mixed-precision algorithms have been proposed as a way for scientific computing to benefit from some of the gains seen for artificial intelligence (AI) on recent high performance computing (HPC) platforms. A few applications dominated by…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-16 Aditya Kashi , Nicholson Koukpaizan , Hao Lu , Michael Matheson , Sarp Oral , Feiyi Wang

PBBFMM3D: a parallel black-box algorithm for kernel matrix-vector multiplication

Kernel matrix-vector product is ubiquitous in many science and engineering applications. However, a naive method requires $O(N^2)$ operations, which becomes prohibitive for large-scale problems. We introduce a parallel method that provably…

Mathematical Software · Computer Science 2021-04-30 Ruoxi Wang , Chao Chen , Jonghyun Lee , Eric Darve

foap4: Adaptive mesh refinement with OpenACC, MPI, and p4est

GPUs and other accelerators are increasingly used for scientific computing. In the future, we want to add GPU support to parallel adaptive mesh refinement (AMR) codes written in Fortran. To understand which changes are necessary to obtain…

Computational Physics · Physics 2026-05-11 Jannis Teunissen , Héctor R. Olivares Sánchez , Jesse Vos , Leon Oostrum , Johan Hidding , Victor Azizi , Yuhao Zhou , Hao Wu , Adrian Kelly , Olaf Willocx , Chun Xia , Rony Keppens , Oliver Porth

aims-PAX: Parallel Active eXploration for the automated construction of Machine Learning Force Fields

Recent advances in machine learning force fields (MLFF) have significantly extended the reach of atomistic simulations. Continuous progress in this field requires reliable reference datasets, accurate MLFF architectures, and efficient…

Chemical Physics · Physics 2025-10-24 Tobias Henkes , Shubham Sharma , Alexandre Tkatchenko , Mariana Rossi , Igor Poltavskyi

Accelerating Particle-in-Cell Monte Carlo Simulations with MPI, OpenMP/OpenACC and Asynchronous Multi-GPU Programming

As fusion energy devices advance, plasma simulations are crucial for reactor design. Our work extends BIT1 hybrid parallelization by integrating MPI with OpenMP and OpenACC, focusing on asynchronous multi-GPU programming. Results show…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-28 Jeremy J. Williams , Felix Liu , Jordy Trilaksono , David Tskhakaya , Stefan Costea , Leon Kos , Ales Podolnik , Jakub Hromadka , Pratibha Hegde , Marta Garcia-Gasulla , Valentin Seitz , Frank Jenko , Erwin Laure , Stefano Markidis

POAS: A high-performance scheduling framework for exploiting Accelerator Level Parallelism

Heterogeneous computing is becoming mainstream in all scopes. This new era in computer architecture brings a new paradigm called Accelerator Level Parallelism (ALP). In ALP, accelerators are used concurrently to provide unprecedented levels…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-22 Pablo Antonio Martínez , Gregorio Bernabé , Jose Manuel García

Efficient and Accurate Spatial Mixing of Machine Learned Interatomic Potentials for Materials Science

Machine-learned interatomic potentials can offer near first-principles accuracy but are computationally expensive, limiting their application to large-scale molecular dynamics simulations. Inspired by quantum mechanics/molecular mechanics…

Materials Science · Physics 2025-11-21 Fraser Birks , Matthew Nutter , Thomas D Swinburne , James R Kermode

Benchmarking mixed-mode PETSc performance on high-performance architectures

The trend towards highly parallel multi-processing is ubiquitous in all modern computer architectures, ranging from handheld devices to large-scale HPC systems; yet many applications are struggling to fully utilise the multiple levels of…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-07-19 Michael Lange , Gerard Gorman , Michele Weiland , Lawrence Mitchell , Xiaohu Guo , James Southern

Implementing Multi-GPU Scientific Computing Miniapps Across Performance Portable Frameworks

Scientific computing in the exascale era demands increased computational power to solve complex problems across various domains. With the rise of heterogeneous computing architectures the need for vendor-agnostic, performance portability…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-05 Johansell Villalobos , Josef Ruzicka , Silvio Rizzi

A Scalable Shared-Memory Parallel Simplex for Large-Scale Linear Programming

The Simplex tableau has been broadly used and investigated in the industry and academia. With the advent of the big data era, ever larger problems are posed to be solved in ever larger machines whose architecture type did not exist in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-05-29 Demetrios Coutinho , Felipe O. Lins e Silva , Daniel Aloise , Samuel , Xavier-de-Souza

GraphLab: A New Framework for Parallel Machine Learning

Designing and implementing efficient, provably correct parallel machine learning (ML) algorithms is challenging. Existing high-level parallel abstractions like MapReduce are insufficiently expressive while low-level tools like MPI and…

Machine Learning · Computer Science 2010-06-28 Yucheng Low , Joseph Gonzalez , Aapo Kyrola , Danny Bickson , Carlos Guestrin , Joseph M. Hellerstein