English
Related papers

Related papers: Efficient Interleaved Batch Matrix Solvers for CUD…

200 papers

In this paper we present a methodology for data accesses when solving batches of Tridiagonal and Pentadiagonal matrices that all share the same left-hand-side (LHS) matrix. The intended application is to the numerical solution of Partial…

Computational Physics · Physics 2021-07-13 Enda Carroll , Andrew Gloster , Miguel D. Bustamante , Lennon Ó' Náraigh

We introduce cuPentBatch -- our own pentadiagonal solver for NVIDIA GPUs. The development of cuPentBatch has been motivated by applications involving numerical solutions of parabolic partial differential equations, which we describe. Our…

Computational Physics · Physics 2019-06-26 Andrew Gloster , Lennon Ó Náraigh , Khang Ee Pang

Block-tridiagonal systems are prevalent in state estimation and optimal control, and solving these systems is often the computational bottleneck. Improving the underlying solvers therefore has a direct impact on the real-time performance of…

Mathematical Software · Computer Science 2025-12-05 David Jin , Alexis Montoison , Sungho Shin

We describe the GPU implementation of shifted or multimass iterative solvers for sparse linear systems of the sort encountered in lattice gauge theory. We provide a generic tool that can be used by those without GPU programming experience…

High Energy Physics - Lattice · Physics 2011-02-16 Richard Galvez , Greg van Anders

Linear Programs (LPs) appear in a large number of applications and offloading them to a GPU is viable to gain performance. Existing work on offloading and solving an LP on a GPU suggests that there is performance gain generally on large…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-02-26 Amit Gurung , Rajarshi Ray

Graphics Processing Units (GPUs) have become the standard in accelerating scientific applications on heterogeneous systems. However, as GPUs are getting faster, one potential performance bottleneck with GPU-accelerated applications is the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-01 Jonah Ekelund , Stefano Markidis , Ivy Peng

The acceleration of sparse matrix computations on modern many-core processors, such as the graphics processing units (GPUs), has been recognized and studied over a decade. Significant performance enhancements have been achieved for many…

Mathematical Software · Computer Science 2017-10-16 Ruipeng Li

We present a batched first-order method for solving multiple linear programs in parallel on GPUs. Our approach extends the primal-dual hybrid gradient algorithm to efficiently solve batches of related linear programming problems that arise…

Optimization and Control · Mathematics 2026-01-30 Nicolas Blin , Stefano Gualandi , Christopher Maes , Andrea Lodi , Bartolomeo Stellato

Matrix decompositions are ubiquitous in machine learning, including applications in dimensionality reduction, data compression and deep learning algorithms. Typical solutions for matrix decompositions have polynomial complexity which…

Machine Learning · Computer Science 2024-03-13 Łukasz Struski , Paweł Morkisz , Przemysław Spurek , Samuel Rodriguez Bernabeu , Tomasz Trzciński

Latent Dirichlet Allocation(LDA) is a popular topic model. Given the fact that the input corpus of LDA algorithms consists of millions to billions of tokens, the LDA training process is very time-consuming, which may prevent the usage of…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-03-14 Xiaolong Xie , Yun Liang , Xiuhong Li , Wei Tan

Nowadays, the paradigm of parallel computing is changing. CUDA is now a popular programming model for general purpose computations on GPUs and a great number of applications were ported to CUDA obtaining speedups of orders of magnitude…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-02-09 Bogdan Oancea , Tudorel Andrei

Solving very large linear systems of equations is a key computational task in science and technology. In many cases, the coefficient matrix of the linear system is rank-deficient, leading to systems that may be underdetermined,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-13 Mónica Chillarón , Gregorio Quintana-Ortí , Vicente Vidal , Per-Gunnar Martinsson

We propose a new hybrid topology optimization algorithm based on multigrid approach that combines the parallelization strategy of CPU using OpenMP and heavily multithreading capabilities of modern Graphics Processing Units (GPU). In…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-01 Arya Prakash Padhi , Souvik Chakraborty , Anupam Chakrabarti , Rajib Chowdhury

This paper introduces cuHALLaR, a GPU-accelerated implementation of the HALLaR method proposed in Monteiro et al. 2024 for solving large-scale semidefinite programming (SDP) problems. We demonstrate how our Julia-based implementation…

This paper presents a novel, high-performance, graphical processing unit-based algorithm for efficiently solving two-dimensional linear programs in batches. The domain of two-dimensional linear programs is particularly useful due to the…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-02-14 John Charlton , Steve Maddock , Paul Richmond

Linear solvers are major computational bottlenecks in a wide range of decision support and optimization computations. The challenges become even more pronounced on heterogeneous hardware, where traditional sparse numerical linear algebra…

Computational Engineering, Finance, and Science · Computer Science 2024-01-26 Kasia Świrydowicz , Nicholson Koukpaizan , Maksudul Alam , Shaked Regev , Michael Saunders , Slaven Peleš

Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-30 Mufakir Qamar Ansari , Mudabir Qamar Ansari

Linear Programs (LPs) appear in a large number of applications and offloading them to the GPU is viable to gain performance. Existing work on offloading and solving an LP on GPU suggests that performance is gained from large sized LPs…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-09-27 Amit Gurung , Rajarshi Ray

In this paper, an efficient divide-and-conquer (DC) algorithm is proposed for the symmetric tridiagonal matrices based on ScaLAPACK and the hierarchically semiseparable (HSS) matrices. HSS is an important type of rank-structured…

Mathematical Software · Computer Science 2016-12-27 Shengguo Li , Francois-Henry Rouet , Jie Liu , Chun Huang , Xingyu Gao , Xuebin Chi

This paper presents a heuristic for finding the optimum number of CUDA streams by using tools common to the modern AI-oriented approaches and applied to the parallel partition algorithm. A time complexity model for the GPU realization of…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-22 Milena Veneva , Toshiyuki Imamura
‹ Prev 1 2 3 10 Next ›