Related papers: Pretty Fast Analysis: An embarrassingly parallel a…

Parallel Performance of Molecular Dynamics Trajectory Analysis

The performance of biomolecular molecular dynamics simulations has steadily increased on modern high performance computing resources but acceleration of the analysis of the output trajectories has lagged behind so that analyzing simulations…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-03-31 Mahzad Khoshlessan , Ioannis Paraskevakos , Geoffrey C. Fox , Shantenu Jha , Oliver Beckstein

Parallel Binary Code Analysis

Binary code analysis is widely used to assess a program's correctness, performance, and provenance. Binary analysis applications often construct control flow graphs, analyze data flow, and use debugging information to understand how machine…

Performance · Computer Science 2020-05-19 Xiaozhu Meng , Jonathon M. Anderson , John Mellor-Crummey , Mark W. Krentel , Barton P. Miller , Srđan Milaković

Automatic Parallelization of Sequential Programs

Prior work on Automatically Scalable Computation (ASC) suggests that it is possible to parallelize sequential computation by building a model of whole-program execution, using that model to predict future computations, and then…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-09-21 Peter Kraft , Amos Waterland , Daniel Y Fu , Anitha Gollamudi , Shai Szulanski , Margo Seltzer

Automated Programmatic Performance Analysis of Parallel Programs

Developing efficient parallel applications is critical to advancing scientific development but requires significant performance analysis and optimization. Performance analysis tools help developers manage the increasing complexity and scale…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-01-25 Onur Cankur , Aditya Tomar , Daniel Nichols , Connor Scully-Allison , Katherine E. Isaacs , Abhinav Bhatele

Efficient Parallelization of Short-Range Molecular Dynamics Simulations on Many-Core Systems

This article introduces a highly parallel algorithm for molecular dynamics simulations with short-range forces on single node multi- and many-core systems. The algorithm is designed to achieve high parallel speedups for strongly…

Computational Physics · Physics 2013-11-20 R. Meyer

Making Applications Faster by Asynchronous Execution: Slowing Down Processes or Relaxing MPI Collectives

Comprehending the performance bottlenecks at the core of the intricate hardware-software interactions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatically asynchronous MPI…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-06 Ayesha Afzal , Georg Hager , Stefano Markidis , Gerhard Wellein

A fast PC algorithm for high dimensional causal discovery with multi-core PCs

Discovering causal relationships from observational data is a crucial problem and it has applications in many research areas. The PC algorithm is the state-of-the-art constraint based method for causal discovery. However, runtime of the PC…

Artificial Intelligence · Computer Science 2016-11-11 Thuc Duy Le , Tao Hoang , Jiuyong Li , Lin Liu , Huawen Liu

Estimating the overlap between dependent computations for automatic parallelization

Researchers working on the automatic parallelization of programs have long known that too much parallelism can be even worse for performance than too little, because spawning a task to be run on another CPU incurs overheads.…

Programming Languages · Computer Science 2011-09-08 Paul Bone , Zoltan Somogyi , Peter Schachte

Parallel Mapper

The construction of Mapper has emerged in the last decade as a powerful and effective topological data analysis tool that approximates and generalizes other topological summaries, such as the Reeb graph, the contour tree, split, and joint…

Computer Vision and Pattern Recognition · Computer Science 2020-09-15 Mustafa Hajij , Basem Assiri , Paul Rosen

HPC optimal parallel communication algorithm for the simulation of fractional-order systems

A parallel numerical simulation algorithm is presented for fractional-order systems involving Caputo-type derivatives, based on the Adams-Bashforth-Moulton (ABM) predictor-corrector scheme. The parallel algorithm is implemented using…

Mathematical Software · Computer Science 2017-10-04 Cosmin Bonchis , Eva Kaslik , Florin Rosu

Lessons learned from a performance analysis and optimization of a multiscale cellular simulation

This work presents a comprehensive performance analysis and optimization of a multiscale agent-based cellular simulation. The optimizations applied are guided by detailed performance analysis and include memory management, load balance, and…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-28 Marc Clascà , Marta Garcia-Gasulla , Arnau Montagud , Jose Carbonell Caballero , Alfonso Valencia

An Efficient Particle Tracking Algorithm for Large-Scale Parallel Pseudo-Spectral Simulations of Turbulence

Particle tracking in large-scale numerical simulations of turbulent flows presents one of the major bottlenecks in parallel performance and scaling efficiency. Here, we describe a particle tracking algorithm for large-scale parallel…

Fluid Dynamics · Physics 2022-05-31 Cristian C. Lalescu , Bérenger Bramas , Markus Rampp , Michael Wilczek

PRAND: GPU accelerated parallel random number generation library: Using most reliable algorithms and applying parallelism of modern GPUs and CPUs

The library PRAND for pseudorandom number generation for modern CPUs and GPUs is presented. It contains both single-threaded and multi-threaded realizations of a number of modern and most reliable generators recently proposed and studied in…

Computational Physics · Physics 2014-02-18 L. Yu. Barash , L. N. Shchur

Engineering Supercomputing Platforms for Biomolecular Applications

A range of computational biology software (GROMACS, AMBER, NAMD, LAMMPS, OpenMM, Psi4 and RELION) was benchmarked on a representative selection of HPC hardware, including AMD EPYC 7742 CPU nodes, NVIDIA V100 and AMD MI250X GPU nodes, and an…

Biological Physics · Physics 2025-10-14 Robert Welch , Charles Laughton , Oliver Henrich , Tom Burnley , Daniel Cole , Alan Real , Sarah Harris , James Gebbie-Rayet

Efficient Parallel Simulations of Asynchronous Cellular Arrays

A definition for a class of asynchronous cellular arrays is proposed. An example of such asynchrony would be independent Poisson arrivals of cell iterations. The Ising model in the continuous time formulation of Glauber falls into this…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-05-23 Boris D. Lubachevsky

Scaling Studies for Efficient Parameter Search and Parallelism for Large Language Model Pre-training

AI accelerator processing capabilities and memory constraints largely dictate the scale in which machine learning workloads (e.g., training and inference) can be executed within a desirable time frame. Training a state of the art,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-12 Michael Benington , Leo Phan , Chris Pierre Paul , Evan Shoemaker , Priyanka Ranade , Torstein Collett , Grant Hodgson Perez , Christopher Krieger

Automatic Task Parallelization of Dataflow Graphs in ML/DL models

Several methods exist today to accelerate Machine Learning(ML) or Deep-Learning(DL) model performance for training and inference. However, modern techniques that rely on various graph and operator parallelism methodologies rely on search…

Machine Learning · Computer Science 2023-08-23 Srinjoy Das , Lawrence Rauchwerger

Parallelizing the Approximate Minimum Degree Ordering Algorithm: Strategies and Evaluation

The approximate minimum degree algorithm is widely used before numerical factorization to reduce fill-in for sparse matrices. While considerable attention has been given to the numerical factorization process, less focus has been placed on…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-26 Yen-Hsiang Chang , Aydın Buluç , James Demmel

Parallelizing a modern GPU simulator

Simulators are a primary tool in computer architecture research but are extremely computationally intensive. Simulating modern architectures with increased core counts and recent workloads can be challenging, even on modern hardware. This…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-27 Rodrigo Huerta , Antonio González

Bulk-synchronous pseudo-streaming algorithms for many-core accelerators

The bulk-synchronous parallel (BSP) model provides a framework for writing parallel programs with predictable performance. In this paper we extend the BSP model to support what we will call pseudo-streaming algorithms for accelerators. We…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-03-24 Jan-Willem Buurlage , Tom Bannink , Abe Wits