Related papers: Parallelizing the Approximate Minimum Degree Order…
In this paper, we present several improvements in the parallelization of the in-place merge algorithm, which merges two contiguous sorted arrays into one with an O(T) space complexity (where T is the number of threads). The approach divides…
The Simplex tableau has been broadly used and investigated in the industry and academia. With the advent of the big data era, ever larger problems are posed to be solved in ever larger machines whose architecture type did not exist in the…
Approximation via sampling is a widespread technique whenever exact solutions are too expensive. In this paper, we present techniques for an efficient parallelization of adaptive (a. k. a. progressive) sampling algorithms on multi-threaded…
This article introduces a highly parallel algorithm for molecular dynamics simulations with short-range forces on single node multi- and many-core systems. The algorithm is designed to achieve high parallel speedups for strongly…
Data processing systems offer an ever increasing degree of parallelism on the levels of cores, CPUs, and processing nodes. Query optimization must exploit high degrees of parallelism in order not to gradually become the bottleneck of query…
We present a novel characterization of the mapping of multiple parallelism forms (e.g. data and model parallelism) onto hierarchical accelerator systems that is hierarchy-aware and greatly reduces the space of software-to-hardware mapping.…
This work focuses on accelerating the multiplication of a dense random matrix with a (fixed) sparse matrix, which is frequently used in sketching algorithms. We develop a novel scheme that takes advantage of blocking and recomputation…
As the artificial intelligence community advances into the era of large models with billions of parameters, distributed training and inference have become essential. While various parallelism strategies-data, model, sequence, and…
We present a parallel hierarchical solver for general sparse linear systems on distributed-memory machines. For large-scale problems, this fully algebraic algorithm is faster and more memory-efficient than sparse direct solvers because it…
In this paper, we present a concurrent implementation of a powerful topological thinning operator. This operator is able to act directly over grayscale images without modifying their topology. We introduce an adapted parallelization…
Generalized sparse matrix-matrix multiplication is a key primitive for many high performance graph algorithms as well as some linear solvers such as multigrid. We present the first parallel algorithms that achieve increasing speedups for an…
This thesis develops signal-processing algorithms and implementation schemes under constraints of minimal parallelism and memory space, with the goal of improving energy efficiency of low-power computing hardware. We propose (i) a…
Optimal multiple sequence alignment by dynamic programming, like many highly dimensional scientific computing problems, has failed to benefit from the improvements in computing performance brought about by multi-processor systems, due to…
This paper proposes a parallel numerical algorithm to simulate the flow and the transport in a discrete fracture network taking into account the mass exchanges with the surrounding matrix. The discretization of the Darcy fluxes is based on…
We present a shared memory implementation of a parallel algorithm, called delta-stepping, for solving the single source shortest path problem for directed and undirected graphs. In order to reduce synchronization costs we make some…
This paper presents the design and analysis of parallel approximation algorithms for facility-location problems, including $\NC$ and $\RNC$ algorithms for (metric) facility location, $k$-center, $k$-median, and $k$-means. These problems…
The minimum degree algorithm is one of the most widely-used heuristics for reducing the cost of solving large sparse systems of linear equations. It has been studied for nearly half a century and has a rich history of bridging techniques…
Nonnegative matrix factorization (NMF) is a powerful technique for dimension reduction, extracting latent factors and learning part-based representation. For large datasets, NMF performance depends on some major issues: fast algorithms,…
This paper studies parallelization schemes for stochastic Vector Quantization algorithms in order to obtain time speed-ups using distributed resources. We show that the most intuitive parallelization scheme does not lead to better…
In this paper, we consider an approach to the parallelizing of the algorithms realizing the modified probability changigng method with adaptation and partial rollback procedure for constrained pseudo-Boolean optimization problems. Existing…