Related papers: Pretty Fast Analysis: An embarrassingly parallel a…
The performance of biomolecular molecular dynamics simulations has steadily increased on modern high performance computing resources but acceleration of the analysis of the output trajectories has lagged behind so that analyzing simulations…
Binary code analysis is widely used to assess a program's correctness, performance, and provenance. Binary analysis applications often construct control flow graphs, analyze data flow, and use debugging information to understand how machine…
Prior work on Automatically Scalable Computation (ASC) suggests that it is possible to parallelize sequential computation by building a model of whole-program execution, using that model to predict future computations, and then…
Developing efficient parallel applications is critical to advancing scientific development but requires significant performance analysis and optimization. Performance analysis tools help developers manage the increasing complexity and scale…
This article introduces a highly parallel algorithm for molecular dynamics simulations with short-range forces on single node multi- and many-core systems. The algorithm is designed to achieve high parallel speedups for strongly…
Comprehending the performance bottlenecks at the core of the intricate hardware-software interactions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatically asynchronous MPI…
Discovering causal relationships from observational data is a crucial problem and it has applications in many research areas. The PC algorithm is the state-of-the-art constraint based method for causal discovery. However, runtime of the PC…
Researchers working on the automatic parallelization of programs have long known that too much parallelism can be even worse for performance than too little, because spawning a task to be run on another CPU incurs overheads.…
The construction of Mapper has emerged in the last decade as a powerful and effective topological data analysis tool that approximates and generalizes other topological summaries, such as the Reeb graph, the contour tree, split, and joint…
A parallel numerical simulation algorithm is presented for fractional-order systems involving Caputo-type derivatives, based on the Adams-Bashforth-Moulton (ABM) predictor-corrector scheme. The parallel algorithm is implemented using…
This work presents a comprehensive performance analysis and optimization of a multiscale agent-based cellular simulation. The optimizations applied are guided by detailed performance analysis and include memory management, load balance, and…
Particle tracking in large-scale numerical simulations of turbulent flows presents one of the major bottlenecks in parallel performance and scaling efficiency. Here, we describe a particle tracking algorithm for large-scale parallel…
The library PRAND for pseudorandom number generation for modern CPUs and GPUs is presented. It contains both single-threaded and multi-threaded realizations of a number of modern and most reliable generators recently proposed and studied in…
A range of computational biology software (GROMACS, AMBER, NAMD, LAMMPS, OpenMM, Psi4 and RELION) was benchmarked on a representative selection of HPC hardware, including AMD EPYC 7742 CPU nodes, NVIDIA V100 and AMD MI250X GPU nodes, and an…
A definition for a class of asynchronous cellular arrays is proposed. An example of such asynchrony would be independent Poisson arrivals of cell iterations. The Ising model in the continuous time formulation of Glauber falls into this…
AI accelerator processing capabilities and memory constraints largely dictate the scale in which machine learning workloads (e.g., training and inference) can be executed within a desirable time frame. Training a state of the art,…
Several methods exist today to accelerate Machine Learning(ML) or Deep-Learning(DL) model performance for training and inference. However, modern techniques that rely on various graph and operator parallelism methodologies rely on search…
The approximate minimum degree algorithm is widely used before numerical factorization to reduce fill-in for sparse matrices. While considerable attention has been given to the numerical factorization process, less focus has been placed on…
Simulators are a primary tool in computer architecture research but are extremely computationally intensive. Simulating modern architectures with increased core counts and recent workloads can be challenging, even on modern hardware. This…
The bulk-synchronous parallel (BSP) model provides a framework for writing parallel programs with predictable performance. In this paper we extend the BSP model to support what we will call pseudo-streaming algorithms for accelerators. We…