Related papers: Efficient Interleaved Batch Matrix Solvers for CUD…

A Batched GPU Methodology for Numerical Solutions of Partial Differential Equations

In this paper we present a methodology for data accesses when solving batches of Tridiagonal and Pentadiagonal matrices that all share the same left-hand-side (LHS) matrix. The intended application is to the numerical solution of Partial…

Computational Physics · Physics 2021-07-13 Enda Carroll , Andrew Gloster , Miguel D. Bustamante , Lennon Ó' Náraigh

cuPentBatch -- A batched pentadiagonal solver for NVIDIA GPUs

We introduce cuPentBatch -- our own pentadiagonal solver for NVIDIA GPUs. The development of cuPentBatch has been motivated by applications involving numerical solutions of parabolic partial differential equations, which we describe. Our…

Computational Physics · Physics 2019-06-26 Andrew Gloster , Lennon Ó Náraigh , Khang Ee Pang

Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems

Block-tridiagonal systems are prevalent in state estimation and optimal control, and solving these systems is often the computational bottleneck. Improving the underlying solvers therefore has a direct impact on the real-time performance of…

Mathematical Software · Computer Science 2025-12-05 David Jin , Alexis Montoison , Sungho Shin

Accelerating the solution of families of shifted linear systems with CUDA

We describe the GPU implementation of shifted or multimass iterative solvers for sparse linear systems of the sort encountered in lattice gauge theory. We provide a generic tool that can be used by those without GPU programming experience…

High Energy Physics - Lattice · Physics 2011-02-16 Richard Galvez , Greg van Anders

Simultaneous Solving of Batched Linear Programs on a GPU

Linear Programs (LPs) appear in a large number of applications and offloading them to a GPU is viable to gain performance. Existing work on offloading and solving an LP on a GPU suggests that there is performance gain generally on large…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-02-26 Amit Gurung , Rajarshi Ray

Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs

Graphics Processing Units (GPUs) have become the standard in accelerating scientific applications on heterogeneous systems. However, as GPUs are getting faster, one potential performance bottleneck with GPU-accelerated applications is the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-01 Jonah Ekelund , Stefano Markidis , Ivy Peng

On Parallel Solution of Sparse Triangular Linear Systems in CUDA

The acceleration of sparse matrix computations on modern many-core processors, such as the graphics processing units (GPUs), has been recognized and studied over a decade. Significant performance enhancements have been achieved for many…

Mathematical Software · Computer Science 2017-10-16 Ruipeng Li

Batched First-Order Methods for Parallel LP Solving in MIP

We present a batched first-order method for solving multiple linear programs in parallel on GPUs. Our approach extends the primal-dual hybrid gradient algorithm to efficiently solve batches of related linear programming problems that arise…

Optimization and Control · Mathematics 2026-01-30 Nicolas Blin , Stefano Gualandi , Christopher Maes , Andrea Lodi , Bartolomeo Stellato

Efficient GPU implementation of randomized SVD and its applications

Matrix decompositions are ubiquitous in machine learning, including applications in dimensionality reduction, data compression and deep learning algorithms. Typical solutions for matrix decompositions have polynomial complexity which…

Machine Learning · Computer Science 2024-03-13 Łukasz Struski , Paweł Morkisz , Przemysław Spurek , Samuel Rodriguez Bernabeu , Tomasz Trzciński

CuLDA_CGS: Solving Large-scale LDA Problems on GPUs

Latent Dirichlet Allocation(LDA) is a popular topic model. Given the fact that the input corpus of LDA algorithms consists of millions to billions of tokens, the LDA training process is very time-consuming, which may prevent the usage of…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-03-14 Xiaolong Xie , Yun Liang , Xiuhong Li , Wei Tan

Developing a High Performance Software Library with MPI and CUDA for Matrix Computations

Nowadays, the paradigm of parallel computing is changing. CUDA is now a popular programming model for general purpose computations on GPUs and a great number of applications were ported to CUDA obtaining speedups of orders of magnitude…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-02-09 Bogdan Oancea , Tudorel Andrei

Solving Large Rank-Deficient Linear Least-Squares Problems on Shared-Memory CPU Architectures and GPU Architectures

Solving very large linear systems of equations is a key computational task in science and technology. In many cases, the coefficient matrix of the linear system is rank-deficient, leading to systems that may be underdetermined,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-13 Mónica Chillarón , Gregorio Quintana-Ortí , Vicente Vidal , Per-Gunnar Martinsson

Efficient hybrid topology optimization using GPU and homogenization based multigrid approach

We propose a new hybrid topology optimization algorithm based on multigrid approach that combines the parallelization strategy of CPU using OpenMP and heavily multithreading capabilities of modern Graphics Processing Units (GPU). In…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-01 Arya Prakash Padhi , Souvik Chakraborty , Anupam Chakrabarti , Rajib Chowdhury

cuHALLaR: A GPU Accelerated Low-Rank Augmented Lagrangian Method for Large-Scale Semidefinite Programming

This paper introduces cuHALLaR, a GPU-accelerated implementation of the HALLaR method proposed in Monteiro et al. 2024 for solving large-scale semidefinite programming (SDP) problems. We demonstrate how our Julia-based implementation…

Optimization and Control · Mathematics 2025-10-27 Jacob M. Aguirre , Diego Cifuentes , Vincent Guigues , Renato D. C. Monteiro , Victor Hugo Nascimento , Arnesh Sujanani

Two-Dimensional Batch Linear Programming on the GPU

This paper presents a novel, high-performance, graphical processing unit-based algorithm for efficiently solving two-dimensional linear programs in batches. The domain of two-dimensional linear programs is particularly useful due to the…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-02-14 John Charlton , Steve Maddock , Paul Richmond

Iterative Methods in GPU-Resident Linear Solvers for Nonlinear Constrained Optimization

Linear solvers are major computational bottlenecks in a wide range of decision support and optimization computations. The challenges become even more pronounced on heterogeneous hardware, where traditional sparse numerical linear algebra…

Computational Engineering, Finance, and Science · Computer Science 2024-01-26 Kasia Świrydowicz , Nicholson Koukpaizan , Maksudul Alam , Shaked Regev , Michael Saunders , Slaven Peleš

Accelerating Matrix Multiplication: A Performance Comparison Between Multi-Core CPU and GPU

Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-30 Mufakir Qamar Ansari , Mudabir Qamar Ansari

Solving Batched Linear Programs on GPU and Multicore CPU

Linear Programs (LPs) appear in a large number of applications and offloading them to the GPU is viable to gain performance. Existing work on offloading and solving an LP on GPU suggests that performance is gained from large sized LPs…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-09-27 Amit Gurung , Rajarshi Ray

An efficient hybrid tridiagonal divide-and-conquer algorithm on distributed memory architectures

In this paper, an efficient divide-and-conquer (DC) algorithm is proposed for the symmetric tridiagonal matrices based on ScaLAPACK and the hierarchically semiseparable (HSS) matrices. HSS is an important type of rank-structured…

Mathematical Software · Computer Science 2016-12-27 Shengguo Li , Francois-Henry Rouet , Jie Liu , Chun Huang , Xingyu Gao , Xuebin Chi

ML-Based Optimum Number of CUDA Streams for the GPU Implementation of the Tridiagonal Partition Method

This paper presents a heuristic for finding the optimum number of CUDA streams by using tools common to the modern AI-oriented approaches and applied to the parallel partition algorithm. A time complexity model for the GPU realization of…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-22 Milena Veneva , Toshiyuki Imamura