English
Related papers

Related papers: Parallel time integration using Batched BLAS (Basi…

200 papers

Matrix decompositions are ubiquitous in machine learning, including applications in dimensionality reduction, data compression and deep learning algorithms. Typical solutions for matrix decompositions have polynomial complexity which…

Machine Learning · Computer Science 2024-03-13 Łukasz Struski , Paweł Morkisz , Przemysław Spurek , Samuel Rodriguez Bernabeu , Tomasz Trzciński

Massively parallel computer architectures create new opportunities for the performance of long-timescale molecular dynamics (MD) simulations. Here, we introduce the path-accelerated molecular dynamics (PAMD) method that takes advantage of…

Computational Physics · Physics 2021-01-11 Jorge L. Rosa-Raíces , Bin Zhang , Thomas F. Miller

Basic Linear Algebra Subprograms (BLAS) are a set of low level linear algebra kernels widely adopted by applications involved with the deep learning and scientific computing. The massive and economic computing power brought forth by the…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-10-20 Linnan Wang , Wei Wu , Jianxiong Xiao , Yi Yang

Basic Linear Algebra Subprograms (BLAS) play key role in high performance and scientific computing applications. Experimentally, yesteryear multicore and General Purpose Graphics Processing Units (GPGPUs) are capable of achieving up to 15…

Hardware Architecture · Computer Science 2016-11-29 Farhad Merchant , Tarun Vatwani , Anupam Chattopadhyay , Soumyendu Raha , S K Nandy , Ranjani Narayan

Block-tridiagonal systems are prevalent in state estimation and optimal control, and solving these systems is often the computational bottleneck. Improving the underlying solvers therefore has a direct impact on the real-time performance of…

Mathematical Software · Computer Science 2025-12-05 David Jin , Alexis Montoison , Sungho Shin

Modern GPUs are able to perform significantly more arithmetic operations than transfers of a single word to or from global memory. Hence, many GPU kernels are limited by memory bandwidth and cannot exploit the arithmetic power of GPUs.…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-09-13 J. Filipovič , M. Madzin , J. Fousek , L. Matyska

The introduction of accelerator devices such as graphics processing units (GPUs) has had profound impact on molecular dynamics simulations and has enabled order-of-magnitude performance advances using commodity hardware. To fully reap these…

Computational Physics · Physics 2020-10-28 Szilárd Páll , Artem Zhmurov , Paul Bauer , Mark Abraham , Magnus Lundborg , Alan Gray , Berk Hess , Erik Lindahl

Parallel algorithms on CPU and GPU are implemented for the Unified Gas-Kinetic Scheme and their performances are investigated and compared by a two dimensional channel flow case. The parallel CPU algorithm has a one dimensional block…

Computational Physics · Physics 2018-11-02 Jizhou Liu , Fang Q. Hu , Xiaodong Li

Parallel batched data structures are designed to process synchronized batches of operations in a parallel computing model. In this paper, we propose parallel combining, a technique that implements a concurrent data structure from a parallel…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-11-14 Vitaly Aksenov , Petr Kuznetsov , Anatoly Shalyto

Parallel programming models can encourage performance portability by moving the responsibility for work assignment and data distribution from the programmer to a runtime system. However, analyzing the resulting implicit memory allocations,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-14 Fabian Knorr , Philip Salzmann , Peter Thoman , Thomas Fahringer

Parallel processing of information plays a critical role in accelerating computation. This includes quantum computers, where parallel processing of quantum information will play a critical role in practical quantum advantage. Here, we…

Basic Linear Algebra Subprograms (BLAS) is a core library in scientific computing and machine learning. This paper presents FT-BLAS, a new implementation of BLAS routines that not only tolerates soft errors on the fly, but also provides…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-09 Yujia Zhai , Elisabeth Giem , Quan Fan , Kai Zhao , Jinyang Liu , Zizhong Chen

We describe a method for parallelizing the lexicographic enumeration algorithm for the factorization set of an element in a numerical semigroup via bounds. This enables the use of GPU and distributed computing methods. We provide a CUDA…

Commutative Algebra · Mathematics 2024-05-14 Thomas Barron

Recent years have witnessed an unprecedented increase in experiments and hybrid simulations involving quantum computers. In particular, quantum annealers. Although quantum supremacy has not been established thus far, there exist a plethora…

Quantum Physics · Physics 2019-12-10 Konrad Jałowiecki , Andrzej Więckowski , Piotr Gawron , Bartłomiej Gardas

Heterogeneous computing is becoming mainstream in all scopes. This new era in computer architecture brings a new paradigm called Accelerator Level Parallelism (ALP). In ALP, accelerators are used concurrently to provide unprecedented levels…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-22 Pablo Antonio Martínez , Gregorio Bernabé , Jose Manuel García

Linear Programs (LPs) appear in a large number of applications and offloading them to a GPU is viable to gain performance. Existing work on offloading and solving an LP on a GPU suggests that there is performance gain generally on large…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-02-26 Amit Gurung , Rajarshi Ray

Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-30 Mufakir Qamar Ansari , Mudabir Qamar Ansari

In this work, we present an extension of Gaussian process (GP) models with sophisticated parallelization and GPU acceleration. The parallelization scheme arises naturally from the modular computational structure w.r.t. datapoints in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-10-21 Zhenwen Dai , Andreas Damianou , James Hensman , Neil Lawrence

Sampling-based planning has become a de facto standard for complex robots given its superior ability to rapidly explore high-dimensional configuration spaces. Most existing optimal sampling-based planning algorithms are sequential in nature…

Robotics · Computer Science 2020-09-10 R. Connor Lawson , Linda Wills , Panagiotis Tsiotras

Combinatorial algorithms such as those that arise in graph analysis, modeling of discrete systems, bioinformatics, and chemistry, are often hard to parallelize. The Combinatorial BLAS library implements key computational primitives for…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-29 Ariful Azad , Oguz Selvitopi , Md Taufique Hussain , John R. Gilbert , Aydin Buluc
‹ Prev 1 2 3 10 Next ›