Related papers: A communication-avoiding parallel algorithm for th…

Minimizing Communication for Eigenproblems and the Singular Value Decomposition

Algorithms have two costs: arithmetic and communication. The latter represents the cost of moving data, either between levels of a memory hierarchy, or between processors over a network. Communication often dominates arithmetic and…

Numerical Analysis · Mathematics 2010-11-16 Grey Ballard , James Demmel , Ioana Dumitriu

Communication Lower Bounds and Algorithms for Sketching with Random Dense Matrices

Sketching is widely used in randomized linear algebra for low-rank matrix approximation, column subset selection, and many other problems, and it has gained significant traction in machine learning applications. However, sketching large…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-24 Hussam Al Daas , Grey Ballard , Laura Grigori , Md Taufique Hussain , Suraj Kumar , Mohammad Marufur Rahman , Kathryn Rouse

A 3D Parallel Algorithm for QR Decomposition

Interprocessor communication often dominates the runtime of large matrix computations. We present a parallel algorithm for computing QR decompositions whose bandwidth cost (communication volume) can be decreased at the cost of increasing…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-15 Grey Ballard , James Demmel , Laura Grigori , Mathias Jacquelin , Nicholas Knight

Minimizing the Arithmetic and Communication Complexity of Jacobi's Method for Eigenvalues and Singular Values: Part One -- Serial Algorithms

We analyze several versions of Jacobi's method for the symmetric eigenvalue problem. Our goal is to reduce the asymptotic cost of the algorithm as much as possible, as measured by the number of arithmetic operations performed and associated…

Numerical Analysis · Mathematics 2026-04-21 James Demmel , Hengrui Luo , Ryan Schneider , Yifu Wang

A Communication Avoiding and Reducing Algorithm for Symmetric Eigenproblem for Very Small Matrices

In this paper, a parallel symmetric eigensolver with very small matrices in massively parallel processing is considered. We define very small matrices that fit the sizes of caches per node in a supercomputer. We assume that the sizes also…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-02 Takahiro Katagiri , Jun'ichi Iwata , Kazuyuki Uchida

Implementing Communication-Optimal Parallel and Sequential QR Factorizations

We present parallel and sequential dense QR factorization algorithms for tall and skinny matrices and general rectangular matrices that both minimize communication, and are as stable as Householder QR. The sequential and parallel algorithms…

Numerical Analysis · Mathematics 2008-09-16 James Demmel , Laura Grigori , Mark Hoemmen , Julien Langou

Minimizing Communication for Parallel Symmetric Tensor Times Same Vector Computation

In this article, we focus on the parallel communication cost of multiplying the same vector along two modes of a $3$-dimensional symmetric tensor. This is a key computation in the higher-order power method for determining eigenpairs of a…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-19 Hussam Al Daas , Grey Ballard , Laura Grigori , Suraj Kumar , Kathryn Rouse , Mathieu Vérité

A Shift Selection Strategy for Parallel Shift-Invert Spectrum Slicing in Symmetric Self-Consistent Eigenvalue Computation

The central importance of large scale eigenvalue problems in scientific computation necessitates the development of massively parallel algorithms for their solution. Recent advances in dense numerical linear algebra have enabled the routine…

Numerical Analysis · Mathematics 2020-05-08 David B. Williams-Young , Paul G. Beckman , Chao Yang

An efficient parallel algorithm for O(N^2) direct summation method and its variations on distributed-memory parallel machines

We present a novel, highly efficient algorithm to parallelize O(N^2) direct summation method for N-body problems with individual timesteps on distributed-memory parallel machines such as Beowulf clusters. Previously known algorithms, in…

Astrophysics · Physics 2009-11-07 Junichiro Makino

BSP Sorting: An experimental Study

The Bulk-Synchronous Parallel model of computation has been used for the architecture independent design and analysis of parallel algorithms whose performance is expressed not only in terms of problem size n but also in terms of parallel…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-08-29 Alexandros V. Gerbessiotis , Constantinos J. Siniolakis

Communication-Optimal Parallel Algorithm for Strassen's Matrix Multiplication

Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes…

Data Structures and Algorithms · Computer Science 2012-02-16 Grey Ballard , James Demmel , Olga Holtz , Benjamin Lipshitz , Oded Schwartz

A Practical Parallel Algorithm for Diameter Approximation of Massive Weighted Graphs

We present a space and time efficient practical parallel algorithm for approximating the diameter of massive weighted undirected graphs on distributed platforms supporting a MapReduce-like abstraction. The core of the algorithm is a…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-11-10 Matteo Ceccarello , Andrea Pietracaprina , Geppino Pucci , Eli Upfal

A Scalable Shared-Memory Parallel Simplex for Large-Scale Linear Programming

The Simplex tableau has been broadly used and investigated in the industry and academia. With the advent of the big data era, ever larger problems are posed to be solved in ever larger machines whose architecture type did not exist in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-05-29 Demetrios Coutinho , Felipe O. Lins e Silva , Daniel Aloise , Samuel , Xavier-de-Souza

Shared Memory Parallelization of MTTKRP for Dense Tensors

The matricized-tensor times Khatri-Rao product (MTTKRP) is the computational bottleneck for algorithms computing CP decompositions of tensors. In this paper, we develop shared-memory parallel algorithms for MTTKRP involving dense tensors.…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-08-31 Koby Hayashi , Grey Ballard , Jeffrey Jiang , Michael Tobia

A distributed-memory hierarchical solver for general sparse linear systems

We present a parallel hierarchical solver for general sparse linear systems on distributed-memory machines. For large-scale problems, this fully algebraic algorithm is faster and more memory-efficient than sparse direct solvers because it…

Numerical Analysis · Mathematics 2017-12-21 Chao Chen , Hadi Pouransari , Sivasankaran Rajamanickam , Erik G. Boman , Eric Darve

Parallel Evolutionary Computation in Very Large Scale Eigenvalue Problems

The history of research on eigenvalue problems is rich with many outstanding contributions. Nonetheless, the rapidly increasing size of data sets requires new algorithms for old problems in the context of extremely large matrix dimensions.…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-12-17 Hesam T. Dashti , Alireza F. Siahpirani , Liya Wang , Mary Kloc , Amir H. Assadi

Compression and Sieve: Reducing Communication in Parallel Breadth First Search on Distributed Memory Systems

For parallel breadth first search (BFS) algorithm on large-scale distributed memory systems, communication often costs significantly more than arithmetic and limits the scalability of the algorithm. In this paper we sufficiently reduce the…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-08-29 Huiwei Lv , Guangming Tan , Mingyu Chen , Ninghui Sun

Parallel Nonnegative CP Decomposition of Dense Tensors

The CP tensor decomposition is a low-rank approximation of a tensor. We present a distributed-memory parallel algorithm and implementation of an alternating optimization method for computing a CP decomposition of dense tensor data that can…

Numerical Analysis · Computer Science 2018-06-22 Grey Ballard , Koby Hayashi , Ramakrishnan Kannan

Time and Memory Efficient Parallel Algorithm for Structural Graph Summaries and two Extensions to Incremental Summarization and $k$-Bisimulation for Long $k$-Chaining

We developed a flexible parallel algorithm for graph summarization based on vertex-centric programming and parameterized message passing. The base algorithm supports infinitely many structural graph summary models defined in a formal…

Data Structures and Algorithms · Computer Science 2022-11-07 Till Blume , Jannik Rau , David Richerby , Ansgar Scherp

Category Theory for Supercomputing: The Tensor Product of Linear BSP Algorithms

We show that a particular class of parallel algorithm for linear functions can be straightforwardly generalized to a parallel algorithm of their tensor product. The central idea is to take a model of parallel algorithms -- Bulk Synchronous…

Category Theory · Mathematics 2025-10-02 Thomas Koopman , Rob H. Bisseling , Sven-Bodo Scholz