Related papers: Exploiting nested task-parallelism in the $\mathca…

$O(N)$ distributed direct factorization of structured dense matrices using runtime systems

Structured dense matrices result from boundary integral problems in electrostatics and geostatistics, and also Schur complements in sparse preconditioners such as multi-frontal methods. Exploiting the structure of such matrices can reduce…

Numerical Analysis · Mathematics 2023-11-03 Sameer Deshmukh , Qinxiang Ma , Rio Yokota , George Bosilca

Worksharing Tasks: An Efficient Way to Exploit Irregular and Fine-Grained Loop Parallelism

Shared memory programming models usually provide worksharing and task constructs. The former relies on the efficient fork-join execution model to exploit structured parallelism; while the latter relies on fine-grained synchronization among…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-08 M. Maronas , K. Sala , S. Mateo , E. Ayguadé , V. Beltran Barcelona Supercomputing Center

Programming Parallel Dense Matrix Factorizations with Look-Ahead and OpenMP

We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using OpenMP, that departs from the legacy (or conventional) solution, which simply extracts concurrency from a multithreaded version of BLAS. This…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-04-20 Sandra Catalán , Adrián Castelló , Francisco D. Igual , Rafael Rodríguez-Sánchez , Enrique S. Quintana-Ortí

Handling Nested Parallelism and Extreme Load Imbalance in an Orbital Analysis Code

Nested parallelism exists in scientific codes that are searching multi-dimensional spaces. However, implementations of nested parallelism often have overhead and load balance issues. The Orbital Analysis code we present exhibits a sparse…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-08-01 Benjamin James Gaska , Neha Jothi , Mahdi Soltan Mohammadi , Kat Volk , Michelle Mills Strout

Advanced Synchronization Techniques for Task-based Runtime Systems

Task-based programming models like OmpSs-2 and OpenMP provide a flexible data-flow execution model to exploit dynamic, irregular and nested parallelism. Providing an efficient implementation that scales well with small granularity tasks…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-18 David Álvarez , Kevin Sala , Marcos Maroñas , Aleix Roca , Vicenç Beltran

An inherently parallel H2-ULV factorization for solving dense linear systems on GPUs

Hierarchical low-rank approximation of dense matrices can reduce the complexity of their factorization from O(N^3) to O(N). However, the complex structure of such hierarchical matrices makes them difficult to parallelize. The block size and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-05 Qianxiang Ma , Rio Yokota

Automatic Task Parallelization of Dataflow Graphs in ML/DL models

Several methods exist today to accelerate Machine Learning(ML) or Deep-Learning(DL) model performance for training and inference. However, modern techniques that rely on various graph and operator parallelism methodologies rely on search…

Machine Learning · Computer Science 2023-08-23 Srinjoy Das , Lawrence Rauchwerger

A Task-Parallel Approach for Localized Topological Data Structures

Unstructured meshes are characterized by data points irregularly distributed in the Euclidian space. Due to the irregular nature of these data, computing connectivity information between the mesh elements requires much more time and memory…

Data Structures and Algorithms · Computer Science 2025-04-03 Guoxi Liu , Federico Iuricich

HDOT -- an Approach Towards Productive Programming of Hybrid Applications

MPI applications matter. However, with the advent of many-core processors, traditional MPI applications are challenged to achieve satisfactory performance. This is due to the inability of these applications to respond to load imbalances, to…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-12-20 Jan Ciesko , Pedro J. Martínez-Ferrer , Raúl Peñacoba Veigas , Xavier Teruel , Vicenç Beltran

Randomized Strong Recursive Skeletonization: Simultaneous Compression and LU Factorization of Hierarchical Matrices using Matrix-Vector Products

The hierarchical matrix framework partitions matrices into subblocks that are either small or of low numerical rank, enabling linear storage complexity and efficient matrix-vector multiplication. This work focuses on the $H^2$-matrix format…

Numerical Analysis · Mathematics 2026-02-02 Anna Yesypenko , Per-Gunnar Martinsson

OMP-Engineer: Bridging Syntax Analysis and In-Context Learning for Efficient Automated OpenMP Parallelization

In advancing parallel programming, particularly with OpenMP, the shift towards NLP-based methods marks a significant innovation beyond traditional S2S tools like Autopar and Cetus. These NLP approaches train on extensive datasets of…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-07 Weidong Wang , Haoran Zhu

Scalable Linear Time Dense Direct Solver for 3-D Problems Without Trailing Sub-Matrix Dependencies

Factorization of large dense matrices are ubiquitous in engineering and data science applications, e.g. preconditioners for iterative boundary integral solvers, frontal matrices in sparse multifrontal solvers, and computing the determinant…

Numerical Analysis · Mathematics 2022-08-24 Qianxiang Ma , Sameer Deshmukh , Rio Yokota

Scheduling Distributed Clusters of Parallel Machines: Primal-Dual and LP-based Approximation Algorithms [Full Version]

The Map-Reduce computing framework rose to prominence with datasets of such size that dozens of machines on a single cluster were needed for individual jobs. As datasets approach the exabyte scale, a single job may need distributed…

Data Structures and Algorithms · Computer Science 2016-10-31 Riley Murray , Samir Khuller , Megan Chao

A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures

As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these…

Mathematical Software · Computer Science 2008-06-12 Alfredo Buttari , Julien Langou , Jakub Kurzak , Jack Dongarra

Compressing rank-structured matrices via randomized sampling

Randomized sampling has recently been proven a highly efficient technique for computing approximate factorizations of matrices that have low numerical rank. This paper describes an extension of such techniques to a wider class of matrices…

Numerical Analysis · Mathematics 2015-03-25 Per-Gunnar Martinsson

The OpenMP Cluster Programming Model

Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-16 Hervé Yviquel , Marcio Pereira , Emílio Francesquini , Guilherme Valarini , Gustavo Leite , Pedro Rosso , Rodrigo Ceccato , Carla Cusihualpa , Vitoria Dias , Sandro Rigo , Alan Souza , Guido Araujo

An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling

We present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination, and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which…

Mathematical Software · Computer Science 2015-02-27 Pieter Ghysels , Xiaoye S. Li , Francois-Henry Rouet , Samuel Williams , Artem Napov

Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout

We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block…

Mathematical Software · Computer Science 2016-01-26 Kyungjoo Kim , Sivasankaran Rajamanickam , George Stelle , H. Carter Edwards , Stephen L. Olivier

A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization with Partial Pivoting

We propose two novel techniques for overcoming load-imbalance encountered when implementing so-called look-ahead mechanisms in relevant dense matrix factorizations for the solution of linear systems. Both techniques target the scenario…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-22 Sandra Catalán , José R. Herrero , Enrique S. Quintana-Ortí , Rafael Rodríguez-Sánchez , Robert van de Geijn

Avoiding Serialization Effects in Data-Dependency aware Task Parallel Algorithms for Spatial Decomposition

Spatial decomposition is a popular basis for parallelising code. Cast in the frame of task parallelism, calculations on a spatial domain can be treated as a task. If neighbouring domains interact and share results, access to the specific…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-01-20 Christoph Niethammer , Colin W. Glass , Jose Gracia