Related papers: Parallel time integration using Batched BLAS (Basi…

Efficient GPU implementation of randomized SVD and its applications

Matrix decompositions are ubiquitous in machine learning, including applications in dimensionality reduction, data compression and deep learning algorithms. Typical solutions for matrix decompositions have polynomial complexity which…

Machine Learning · Computer Science 2024-03-13 Łukasz Struski , Paweł Morkisz , Przemysław Spurek , Samuel Rodriguez Bernabeu , Tomasz Trzciński

Path-accelerated molecular dynamics: Parallel-in-time integration using path integrals

Massively parallel computer architectures create new opportunities for the performance of long-timescale molecular dynamics (MD) simulations. Here, we introduce the path-accelerated molecular dynamics (PAMD) method that takes advantage of…

Computational Physics · Physics 2021-01-11 Jorge L. Rosa-Raíces , Bin Zhang , Thomas F. Miller

BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing

Basic Linear Algebra Subprograms (BLAS) are a set of low level linear algebra kernels widely adopted by applications involved with the deep learning and scientific computing. The massive and economic computing power brought forth by the…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-10-20 Linnan Wang , Wei Wu , Jianxiong Xiao , Yi Yang

Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design

Basic Linear Algebra Subprograms (BLAS) play key role in high performance and scientific computing applications. Experimentally, yesteryear multicore and General Purpose Graphics Processing Units (GPGPUs) are capable of achieving up to 15…

Hardware Architecture · Computer Science 2016-11-29 Farhad Merchant , Tarun Vatwani , Anupam Chattopadhyay , Soumyendu Raha , S K Nandy , Ranjani Narayan

Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems

Block-tridiagonal systems are prevalent in state estimation and optimal control, and solving these systems is often the computational bottleneck. Improving the underlying solvers therefore has a direct impact on the real-time performance of…

Mathematical Software · Computer Science 2025-12-05 David Jin , Alexis Montoison , Sungho Shin

Optimizing CUDA Code By Kernel Fusion---Application on BLAS

Modern GPUs are able to perform significantly more arithmetic operations than transfers of a single word to or from global memory. Hence, many GPU kernels are limited by memory bandwidth and cannot exploit the arithmetic power of GPUs.…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-09-13 J. Filipovič , M. Madzin , J. Fousek , L. Matyska

Heterogeneous Parallelization and Acceleration of Molecular Dynamics Simulations in GROMACS

The introduction of accelerator devices such as graphics processing units (GPUs) has had profound impact on molecular dynamics simulations and has enabled order-of-magnitude performance advances using commodity hardware. To fully reap these…

Computational Physics · Physics 2020-10-28 Szilárd Páll , Artem Zhmurov , Paul Bauer , Mark Abraham , Magnus Lundborg , Alan Gray , Berk Hess , Erik Lindahl

Performance Comparison on Parallel CPU and GPU Algorithms for Unified Gas-Kinetic Scheme

Parallel algorithms on CPU and GPU are implemented for the Unified Gas-Kinetic Scheme and their performances are investigated and compared by a two dimensional channel flow case. The parallel CPU algorithm has a one dimensional block…

Computational Physics · Physics 2018-11-02 Jizhou Liu , Fang Q. Hu , Xiaodong Li

Parallel Combining: Benefits of Explicit Synchronization

Parallel batched data structures are designed to process synchronized batches of operations in a parallel computing model. In this paper, we propose parallel combining, a technique that implements a concurrent data structure from a parallel…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-11-14 Vitaly Aksenov , Petr Kuznetsov , Anatoly Shalyto

Concurrent Scheduling of High-Level Parallel Programs on Multi-GPU Systems

Parallel programming models can encourage performance portability by moving the responsibility for work assignment and data distribution from the programmer to a runtime system. However, analyzing the resulting implicit memory allocations,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-14 Fabian Knorr , Philip Salzmann , Peter Thoman , Thomas Fahringer

Arbitrary parallel entangling gates with independent calibration on a trapped ion quantum computer

Parallel processing of information plays a critical role in accelerating computation. This includes quantum computers, where parallel processing of quantum information will play a critical role in practical quantum advantage. Here, we…

Quantum Physics · Physics 2026-04-30 Matthew Diaz , Masoud Mohammadi-Arzanagh , Yingyue Zhu , Mohammad Hafezi , Norbert M. Linke , Alaina M. Green , Arthur Y. Nam

FT-BLAS: A High Performance BLAS Implementation With Online Fault Tolerance

Basic Linear Algebra Subprograms (BLAS) is a core library in scientific computing and machine learning. This paper presents FT-BLAS, a new implementation of BLAS routines that not only tolerates soft errors on the fly, but also provides…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-09 Yujia Zhai , Elisabeth Giem , Quan Fan , Kai Zhao , Jinyang Liu , Zizhong Chen

GPU-accelerated factorization sets in numerical semigroups via parallel bounded lexicographic streams

We describe a method for parallelizing the lexicographic enumeration algorithm for the factorization set of an element in a numerical semigroup via bounds. This enables the use of GPU and distributed computing methods. We provide a CUDA…

Commutative Algebra · Mathematics 2024-05-14 Thomas Barron

Parallel in time dynamics with quantum annealers

Recent years have witnessed an unprecedented increase in experiments and hybrid simulations involving quantum computers. In particular, quantum annealers. Although quantum supremacy has not been established thus far, there exist a plethora…

Quantum Physics · Physics 2019-12-10 Konrad Jałowiecki , Andrzej Więckowski , Piotr Gawron , Bartłomiej Gardas

POAS: A high-performance scheduling framework for exploiting Accelerator Level Parallelism

Heterogeneous computing is becoming mainstream in all scopes. This new era in computer architecture brings a new paradigm called Accelerator Level Parallelism (ALP). In ALP, accelerators are used concurrently to provide unprecedented levels…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-22 Pablo Antonio Martínez , Gregorio Bernabé , Jose Manuel García

Simultaneous Solving of Batched Linear Programs on a GPU

Linear Programs (LPs) appear in a large number of applications and offloading them to a GPU is viable to gain performance. Existing work on offloading and solving an LP on a GPU suggests that there is performance gain generally on large…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-02-26 Amit Gurung , Rajarshi Ray

Accelerating Matrix Multiplication: A Performance Comparison Between Multi-Core CPU and GPU

Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-30 Mufakir Qamar Ansari , Mudabir Qamar Ansari

Gaussian Process Models with Parallelization and GPU acceleration

In this work, we present an extension of Gaussian process (GP) models with sophisticated parallelization and GPU acceleration. The parallelization scheme arises naturally from the modular computational structure w.r.t. datapoints in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-10-21 Zhenwen Dai , Andreas Damianou , James Hensman , Neil Lawrence

GPU Parallelization of Policy Iteration RRT#

Sampling-based planning has become a de facto standard for complex robots given its superior ability to rapidly explore high-dimensional configuration spaces. Most existing optimal sampling-based planning algorithms are sequential in nature…

Robotics · Computer Science 2020-09-10 R. Connor Lawson , Linda Wills , Panagiotis Tsiotras

Combinatorial BLAS 2.0: Scaling combinatorial algorithms on distributed-memory systems

Combinatorial algorithms such as those that arise in graph analysis, modeling of discrete systems, bioinformatics, and chemistry, are often hard to parallelize. The Combinatorial BLAS library implements key computational primitives for…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-29 Ariful Azad , Oguz Selvitopi , Md Taufique Hussain , John R. Gilbert , Aydin Buluc