Related papers: Communication-optimal Parallel and Sequential Chol…

Minimizing Communication in Linear Algebra

In 1981 Hong and Kung proved a lower bound on the amount of communication needed to perform dense, matrix-multiplication using the conventional $O(n^3)$ algorithm, where the input matrices were too large to fit in the small, fast memory. In…

Computational Complexity · Computer Science 2011-09-20 Grey Ballard , James Demmel , Olga Holtz , Oded Schwartz

On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal Matrix Factorizations

Matrix factorizations are among the most important building blocks of scientific computing. State-of-the-art libraries, however, are not communication-optimal, underutilizing current parallel architectures. We present novel algorithms for…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-04-26 Grzegorz Kwasniewski , Marko Kabić , Tal Ben-Nun , Alexandros Nikolaos Ziogas , Jens Eirik Saethre , André Gaillard , Timo Schneider , Maciej Besta , Anton Kozhevnikov , Joost VandeVondele , Torsten Hoefler

Minimizing Communication for Eigenproblems and the Singular Value Decomposition

Algorithms have two costs: arithmetic and communication. The latter represents the cost of moving data, either between levels of a memory hierarchy, or between processors over a network. Communication often dominates arithmetic and…

Numerical Analysis · Mathematics 2010-11-16 Grey Ballard , James Demmel , Ioana Dumitriu

I/O-Optimal Algorithms for Symmetric Linear Algebra Kernels

In this paper, we consider two fundamental symmetric kernels in linear algebra: the Cholesky factorization and the symmetric rank-$k$ update (SYRK), with the classical three nested loops algorithms for these kernels. In addition, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-22 Olivier Beaumont , Lionel Eyraud-Dubois , Mathieu Vérité , Julien Langou

Communication-optimal parallel and sequential QR and LU factorizations

We present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform, and just as stable as Householder QR. We prove optimality by extending…

Numerical Analysis · Mathematics 2008-08-21 James Demmel , Laura Grigori , Mark Hoemmen , Julien Langou

Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations

In this article, we focus on the communication costs of three symmetric matrix computations: i) multiplying a matrix with its transpose, known as a symmetric rank-k update (SYRK) ii) adding the result of the multiplication of a matrix with…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-18 Hussam Al Daas , Grey Ballard , Laura Grigori , Suraj Kumar , Kathryn Rouse , Mathieu Verite

Communication-avoiding Cholesky-QR2 for rectangular matrices

Scalable QR factorization algorithms for solving least squares and eigenvalue problems are critical given the increasing parallelism within modern machines. We introduce a more general parallelization of the CholeskyQR2 algorithm and show…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-06-18 Edward Hutter , Edgar Solomonik

Communication Lower Bounds and Algorithms for Sketching with Random Dense Matrices

Sketching is widely used in randomized linear algebra for low-rank matrix approximation, column subset selection, and many other problems, and it has gained significant traction in machine learning applications. However, sketching large…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-24 Hussam Al Daas , Grey Ballard , Laura Grigori , Md Taufique Hussain , Suraj Kumar , Mohammad Marufur Rahman , Kathryn Rouse

A 3D Parallel Algorithm for QR Decomposition

Interprocessor communication often dominates the runtime of large matrix computations. We present a parallel algorithm for computing QR decompositions whose bandwidth cost (communication volume) can be decreased at the cost of increasing…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-15 Grey Ballard , James Demmel , Laura Grigori , Mathias Jacquelin , Nicholas Knight

Graph Expansion and Communication Costs of Fast Matrix Multiplication

The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication…

Data Structures and Algorithms · Computer Science 2011-09-12 Grey Ballard , James Demmel , Olga Holtz , Oded Schwartz

On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal LU Factorization

Dense linear algebra kernels, such as linear solvers or tensor contractions, are fundamental components of many scientific computing applications. In this work, we present a novel method of deriving parallel I/O lower bounds for this broad…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-14 Grzegorz Kwasniewski , Tal Ben-Nun , Alexandros Nikolaos Ziogas , Timo Schneider , Maciej Besta , Torsten Hoefler

Worst-Case Optimal Algorithms for Parallel Query Processing

In this paper, we study the communication complexity for the problem of computing a conjunctive query on a large database in a parallel setting with $p$ servers. In contrast to previous work, where upper and lower bounds on the…

Databases · Computer Science 2016-04-08 Paul Beame , Paraschos Koutris , Dan Suciu

Upper and Lower Bounds on the Cost of a Map-Reduce Computation

In this paper we study the tradeoff between parallelism and communication cost in a map-reduce computation. For any problem that is not "embarrassingly parallel," the finer we partition the work of the reducers so that more parallelism can…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-06-21 Foto N. Afrati , Anish Das Sarma , Semih Salihoglu , Jeffrey D. Ullman

Communication Lower Bounds and Optimal Algorithms for Multiple Tensor-Times-Matrix Computation

Multiple Tensor-Times-Matrix (Multi-TTM) is a key computation in algorithms for computing and operating with the Tucker tensor decomposition, which is frequently used in multidimensional data analysis. We establish communication lower…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-02-03 Hussam Al Daas , Grey Ballard , Laura Grigori , Suraj Kumar , Kathryn Rouse

Minimizing Communication for Parallel Symmetric Tensor Times Same Vector Computation

In this article, we focus on the parallel communication cost of multiplying the same vector along two modes of a $3$-dimensional symmetric tensor. This is a key computation in the higher-order power method for determining eigenpairs of a…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-19 Hussam Al Daas , Grey Ballard , Laura Grigori , Suraj Kumar , Kathryn Rouse , Mathieu Vérité

Communication Lower Bounds for Matricized Tensor Times Khatri-Rao Product

The matricized-tensor times Khatri-Rao product computation is the typical bottleneck in algorithms for computing a CP decomposition of a tensor. In order to develop high performance sequential and parallel algorithms, we establish…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-10-24 Grey Ballard , Nicholas Knight , Kathryn Rouse

Communication-optimal parallel and sequential QR and LU factorizations: theory and practice

We present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform, and just as stable as Householder QR. Our first algorithm, Tall Skinny…

Numerical Analysis · Computer Science 2008-08-29 James Demmel , Laura Grigori , Mark Hoemmen , Julien Langou

On the Tradeoff Between Computation and Communication Costs for Distributed Linearly Separable Computation

This paper studies the distributed linearly separable computation problem, which is a generalization of many existing distributed computing problems such as distributed gradient descent and distributed linear transform. In this problem, a…

Information Theory · Computer Science 2020-10-06 Kai Wan , Hua Sun , Mingyue Ji , Giuseppe Caire

A New Model for Massively Parallel Computation Considering both Communication and IO Cost

In the research area of parallel computation, the communication cost has been extensively studied, while the IO cost has been neglected. For big data computation, the assumption that the data fits in main memory no longer holds, and…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-25 Hengzhao Ma , Xiangyu Gao , Jianzhong Li , Tianpeng Gao

Differentiation of the Cholesky decomposition

We review strategies for differentiating matrix-based computations, and derive symbolic and algorithmic update rules for differentiating expressions containing the Cholesky decomposition. We recommend new `blocked' algorithms, based on…

Computation · Statistics 2016-02-25 Iain Murray