Related papers: Programming Parallel Dense Matrix Factorizations w…
Processors with large numbers of cores are becoming commonplace. In order to take advantage of the available resources in these systems, the programming paradigm has to move towards increased parallelism. However, increasing the level of…
Nonnegative matrix factorization (NMF) is a powerful technique for dimension reduction, extracting latent factors and learning part-based representation. For large datasets, NMF performance depends on some major issues: fast algorithms,…
Multicore has emerged as a typical architecture model since its advent and stands now as a standard. The trend is to increase the number of cores and improve the performance of the memory system. Providing an efficient multicore…
Dense linear algebra libraries, such as BLAS and LAPACK, provide a relevant collection of numerical tools for many scientific and engineering applications. While there exist high performance implementations of the BLAS (and LAPACK)…
Cholesky factorization is a widely used method for solving linear systems involving symmetric, positive-definite matrices, and can be an attractive choice in applications where a high degree of numerical stability is needed. One such…
As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these…
Parallel processing is considered as todays and future trend for improving performance of computers. Computing devices ranging from small embedded systems to big clusters of computers rely on parallelizing applications to reduce execution…
Several methods exist today to accelerate Machine Learning(ML) or Deep-Learning(DL) model performance for training and inference. However, modern techniques that rely on various graph and operator parallelism methodologies rely on search…
Regions of nested loops are a common feature of High Performance Computing (HPC) codes. In shared memory programming models, such as OpenMP, these structure are the most common source of parallelism. Parallelising these structures requires…
The parallel and distributed processing are becoming de facto industry standard, and a large part of the current research is targeted on how to make computing scalable and distributed, dynamically, without allocating the resources on…
This paper presents our work toward correct and efficient automatic differentiation of OpenMP parallel worksharing loops in forward and reverse mode. Automatic differentiation is a method to obtain gradients of numerical programs, which are…
Given a sparse matrix $A$, the selected inversion algorithm is an efficient method for computing certain selected elements of $A^{-1}$. These selected elements correspond to all or some nonzero elements of the LU factors of $A$. In many…
Matrix factorization (MF) is employed by many popular algorithms, e.g., collaborative filtering. The emerging GPU technology, with massively multicore and high intra-chip memory bandwidth but limited memory capacity, presents an opportunity…
There is an ever-present need for shared memory parallelization schemes to exploit the full potential of multi-core architectures. The most common parallelization API addressing this need today is OpenMP. Nevertheless, writing parallel code…
The Simplex tableau has been broadly used and investigated in the industry and academia. With the advent of the big data era, ever larger problems are posed to be solved in ever larger machines whose architecture type did not exist in the…
Manual parallelization of code remains a significant challenge due to the complexities of modern software systems and the widespread adoption of multi-core architectures. This paper introduces OMPar, an AI-driven tool designed to automate…
This paper advocates for an intertwined design of the dense linear algebra software stack that breaks down the strict barriers between the high-level, blocked algorithms in LAPACK (Linear Algebra PACKage) and the low-level,…
Multiresolution Matrix Factorization (MMF) was recently introduced as a method for finding multiscale structure and defining wavelets on graphs/matrices. In this paper we derive pMMF, a parallel algorithm for computing the MMF…
Direct Multisearch (DMS) is a Derivative-free Optimization class of algorithms suited for computing approximations to the complete Pareto front of a given Multiobjective Optimization problem. It has a well-supported convergence analysis and…
We address the parallelization of the LU factorization of hierarchical matrices ($\mathcal{H}$-matrices) arising from boundary element methods. Our approach exploits task-parallelism via the OmpSs programming model and runtime, which…