Related papers: A Structure-Aware Irregular Blocking Method for Sp…

GLU3.0: Fast GPU-based Parallel Sparse LU Factorization for Circuit Simulation

LU factorization for sparse matrices is the most important computing step for many engineering and scientific computing problems such as circuit simulation. But parallelizing LU factorization with the Graphic Processing Units (GPU) still…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-14 Shaoyi Peng , Sheldon X. -D. Tan

Adaptive Block Sparse Regularization under Arbitrary Linear Transform

We propose a convex and fast signal reconstruction method for block sparsity under arbitrary linear transform with unknown block structure. The proposed method is a generalization of the similar existing method and can reconstruct signals…

Machine Learning · Computer Science 2024-02-14 Takanobu Furuhashi , Hidekata Hontani , Tatsuya Yokota

GPU accelerated matrix factorization of large scale data using block based approach

Matrix Factorization (MF) on large scale data takes substantial time on a Central Processing Unit (CPU). While Graphical Processing Unit (GPU)s could expedite the computation of MF, the available memory on a GPU is finite. Leveraging GPUs…

Machine Learning · Computer Science 2023-04-28 Prasad Bhavana , Vineet Padmanabhan

sTiles: An Accelerated Computational Framework for Sparse Factorizations of Structured Matrices

This paper introduces sTiles, a GPU-accelerated framework for factorizing sparse structured symmetric matrices. By leveraging tile algorithms for fine-grained computations, sTiles uses a structure-aware task execution flow to handle…

Performance · Computer Science 2025-01-07 Esmail Abdul Fattah , Hatem Ltaief , Havard Rue , David Keyes

GSoFa: Scalable Sparse Symbolic LU Factorization on GPUs

Decomposing matrix A into a lower matrix L and an upper matrix U, which is also known as LU decomposition, is an essential operation in numerical linear algebra. For a sparse matrix, LU decomposition often introduces more nonzero entries in…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-11 Anil Gaihre , Xiaoye S. Li , Hang Liu

Fast and Robust Sparsity-Aware Block Diagonal Representation

The block diagonal structure of an affinity matrix is a commonly desired property in cluster analysis because it represents clusters of feature vectors by non-zero coefficients that are concentrated in blocks. However, recovering a block…

Machine Learning · Computer Science 2023-12-05 Aylin Tastan , Michael Muma , Abdelhak M. Zoubir

A Two-level GPU-Accelerated Incomplete LU Preconditioner for General Sparse Linear Systems

This paper presents a parallel preconditioning approach based on incomplete LU (ILU) factorizations in the framework of Domain Decomposition (DD) for general sparse linear systems. We focus on distributed memory parallel architectures,…

Numerical Analysis · Mathematics 2023-03-17 Tianshi Xu , Ruipeng Li , Daniel Osei-Kuffuor

GPU-Accelerated Cholesky Factorization of Block Tridiagonal Matrices

This paper presents a GPU-accelerated framework for solving block tridiagonal linear systems that arise naturally in numerous real-time applications across engineering and scientific computing. Through a multi-stage permutation strategy…

Optimization and Control · Mathematics 2026-01-08 Roland Schwan , Daniel Kuhn , Colin N. Jones

Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity

Network pruning can reduce the high computation cost of deep neural network (DNN) models. However, to maintain their accuracies, sparse models often carry randomly-distributed weights, leading to irregular computations. Consequently, sparse…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-01 Cong Guo , Bo Yang Hsueh , Jingwen Leng , Yuxian Qiu , Yue Guan , Zehuan Wang , Xiaoying Jia , Xipeng Li , Minyi Guo , Yuhao Zhu

Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout

We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block…

Mathematical Software · Computer Science 2016-01-26 Kyungjoo Kim , Sivasankaran Rajamanickam , George Stelle , H. Carter Edwards , Stephen L. Olivier

Blocking Techniques for Sparse Matrix Multiplication on Tensor Accelerators

Tensor accelerators have gained popularity because they provide a cheap and efficient solution for speeding up computational-expensive tasks in Deep Learning and, more recently, in other Scientific Computing applications. However, since…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-15 Paolo Sylos Labini , Massimo Bernaschi , Francesco Silvestri , Flavio Vella

SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference

In recent years, there has been a flurry of research in deep neural network pruning and compression. Early approaches prune weights individually. However, it is difficult to take advantage of the resulting unstructured sparsity patterns on…

Machine Learning · Computer Science 2020-08-28 Ziheng Wang

Staging Blocked Evaluation over Structured Sparse Matrices

The matrices used in many computational settings are naturally sparse, holding a small percentage of nonzero elements. Storing such matrices in specialized sparse formats enables algorithms that avoid wasting computation on zeros,…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-13 Pratyush Das , Amirhossein Basareh , Adhitha Dias , Artem Pelenitsyn , Kirshanthan Sundararajah , Milind Kulkarni , Ben Delaware

GPU-Accelerated Parallel Selected Inversion for Structured Matrices Using sTiles

Selected inversion is essential for applications such as Bayesian inference, electronic structure calculations, and inverse covariance estimation, where computing only specific elements of large sparse matrix inverses significantly reduces…

Performance · Computer Science 2025-09-03 Esmail Abdul Fattah , Hatem Ltaief , Havard Rue , David Keyes

Analysis of A Splitting Approach for the Parallel Solution of Linear Systems on GPU Cards

We discuss an approach for solving sparse or dense banded linear systems ${\bf A} {\bf x} = {\bf b}$ on a Graphics Processing Unit (GPU) card. The matrix ${\bf A} \in {\mathbb{R}}^{N \times N}$ is possibly nonsymmetric and moderately large;…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-09-29 Ang Li , Radu Serban , Dan Negrut

Dynamic Sparse Training of Diagonally Sparse Networks

Recent advances in Dynamic Sparse Training (DST) have pushed the frontier of sparse neural network training in structured and unstructured contexts, matching dense-model performance while drastically reducing parameter counts to facilitate…

Machine Learning · Computer Science 2025-06-16 Abhishek Tyagi , Arjun Iyer , William H Renninger , Christopher Kanan , Yuhao Zhu

LDU factorization

LU-factorization of matrices is one of the fundamental algorithms of linear algebra. The widespread use of supercomputers with distributed memory requires a review of traditional algorithms, which were based on the common memory of a…

Symbolic Computation · Computer Science 2020-11-10 Gennadi Malaschonok

Efficient Sparse Training with Structured Dropout

Dropout is a common regularisation technique in deep learning that improves generalisation. Even though it introduces sparsity and thus potential for higher throughput, it usually cannot bring speed-ups on GPUs due to its unstructured…

Machine Learning · Computer Science 2024-11-05 Andy Lo

Efficient algorithms for computing rank-revealing factorizations on a GPU

Standard rank-revealing factorizations such as the singular value decomposition and column pivoted QR factorization are challenging to implement efficiently on a GPU. A major difficulty in this regard is the inability of standard algorithms…

Numerical Analysis · Mathematics 2023-05-23 Nathan Heavner , Chao Chen , Abinand Gopal , Per-Gunnar Martinsson

Parallel Coordinate Descent Methods for Big Data Optimization

In this work we show that randomized (block) coordinate descent methods can be accelerated by parallelization when applied to the problem of minimizing the sum of a partially separable smooth convex function and a simple separable convex…

Optimization and Control · Mathematics 2013-11-27 Peter Richtárik , Martin Takáč