Related papers: Minimizing Communication in Linear Algebra

Communication Lower Bounds and Algorithms for Sketching with Random Dense Matrices

Sketching is widely used in randomized linear algebra for low-rank matrix approximation, column subset selection, and many other problems, and it has gained significant traction in machine learning applications. However, sketching large…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-24 Hussam Al Daas , Grey Ballard , Laura Grigori , Md Taufique Hussain , Suraj Kumar , Mohammad Marufur Rahman , Kathryn Rouse

On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal Matrix Factorizations

Matrix factorizations are among the most important building blocks of scientific computing. State-of-the-art libraries, however, are not communication-optimal, underutilizing current parallel architectures. We present novel algorithms for…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-04-26 Grzegorz Kwasniewski , Marko Kabić , Tal Ben-Nun , Alexandros Nikolaos Ziogas , Jens Eirik Saethre , André Gaillard , Timo Schneider , Maciej Besta , Anton Kozhevnikov , Joost VandeVondele , Torsten Hoefler

Communication-optimal Parallel and Sequential Cholesky Decomposition

Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case).…

Numerical Analysis · Computer Science 2011-02-02 Grey Ballard , James Demmel , Olga Holtz , Oded Schwartz

Communication-Optimal Tilings for Projective Nested Loops with Arbitrary Bounds

Reducing communication - either between levels of a memory hierarchy or between processors over a network - is a key component of performance optimization (in both time and energy) for many problems, including dense linear algebra, particle…

Data Structures and Algorithms · Computer Science 2020-03-03 Grace Dinh , James Demmel

Communication-optimal parallel and sequential QR and LU factorizations

We present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform, and just as stable as Householder QR. We prove optimality by extending…

Numerical Analysis · Mathematics 2008-08-21 James Demmel , Laura Grigori , Mark Hoemmen , Julien Langou

Graph Expansion and Communication Costs of Fast Matrix Multiplication

The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication…

Data Structures and Algorithms · Computer Science 2011-09-12 Grey Ballard , James Demmel , Olga Holtz , Oded Schwartz

Communication Lower Bounds and Optimal Algorithms for Multiple Tensor-Times-Matrix Computation

Multiple Tensor-Times-Matrix (Multi-TTM) is a key computation in algorithms for computing and operating with the Tucker tensor decomposition, which is frequently used in multidimensional data analysis. We establish communication lower…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-02-03 Hussam Al Daas , Grey Ballard , Laura Grigori , Suraj Kumar , Kathryn Rouse

Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds

Communication lower bounds have long been established for matrix multiplication algorithms. However, most methods of asymptotic analysis have either ignored the constant factors or not obtained the tightest possible values. Recent work has…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-27 Hussam Al Daas , Grey Ballard , Laura Grigori , Suraj Kumar , Kathryn Rouse

Minimizing Communication for Eigenproblems and the Singular Value Decomposition

Algorithms have two costs: arithmetic and communication. The latter represents the cost of moving data, either between levels of a memory hierarchy, or between processors over a network. Communication often dominates arithmetic and…

Numerical Analysis · Mathematics 2010-11-16 Grey Ballard , James Demmel , Ioana Dumitriu

Communication Lower-Bounds for Distributed-Memory Computations for Mass Spectrometry based Omics Data

Mass spectrometry (MS) based omics data analysis require significant time and resources. To date, few parallel algorithms have been proposed for deducing peptides from mass spectrometry-based data. However, these parallel algorithms were…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-08-12 Fahad Saeed , Muhammad Haseeb , SS Iyengar

Communication Lower Bounds for Matricized Tensor Times Khatri-Rao Product

The matricized-tensor times Khatri-Rao product computation is the typical bottleneck in algorithms for computing a CP decomposition of a tensor. In order to develop high performance sequential and parallel algorithms, we establish…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-10-24 Grey Ballard , Nicholas Knight , Kathryn Rouse

The Input/Output Complexity of Sparse Matrix Multiplication

We consider the problem of multiplying sparse matrices (over a semiring) where the number of non-zero entries is larger than main memory. In the classical paper of Hong and Kung (STOC '81) it was shown that to compute a product of dense $U…

Data Structures and Algorithms · Computer Science 2014-03-17 Rasmus Pagh , Morten Stöckel

Communication Lower Bounds for Distributed-Memory Computations

We give lower bounds on the communication complexity required to solve several computational problems in a distributed-memory parallel machine, namely standard matrix multiplication, stencil computations, comparison sorting, and the Fast…

Data Structures and Algorithms · Computer Science 2013-09-24 Michele Scquizzato , Francesco Silvestri

Communication lower bounds and optimal algorithms for programs that reference arrays -- Part 1

The movement of data (communication) between levels of a memory hierarchy, or between parallel processors on a network, can greatly dominate the cost of computation, so algorithms that minimize communication are of interest. Motivated by…

Classical Analysis and ODEs · Mathematics 2013-08-03 Michael Christ , James Demmel , Nicholas Knight , Thomas Scanlon , Katherine Yelick

Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations

In this article, we focus on the communication costs of three symmetric matrix computations: i) multiplying a matrix with its transpose, known as a symmetric rank-k update (SYRK) ii) adding the result of the multiplication of a matrix with…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-18 Hussam Al Daas , Grey Ballard , Laura Grigori , Suraj Kumar , Kathryn Rouse , Mathieu Verite

I/O-Optimal Algorithms for Symmetric Linear Algebra Kernels

In this paper, we consider two fundamental symmetric kernels in linear algebra: the Cholesky factorization and the symmetric rank-$k$ update (SYRK), with the classical three nested loops algorithms for these kernels. In addition, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-22 Olivier Beaumont , Lionel Eyraud-Dubois , Mathieu Vérité , Julien Langou

A Tight I/O Lower Bound for Matrix Multiplication

A tight lower bound for required I/O when computing an ordinary matrix-matrix multiplication on a processor with two layers of memory is established. Prior work obtained weaker lower bounds by reasoning about the number of segments needed…

Computational Complexity · Computer Science 2019-02-07 Tyler Michael Smith , Bradley Lowery , Julien Langou , Robert A. van de Geijn

On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal LU Factorization

Dense linear algebra kernels, such as linear solvers or tensor contractions, are fundamental components of many scientific computing applications. In this work, we present a novel method of deriving parallel I/O lower bounds for this broad…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-14 Grzegorz Kwasniewski , Tal Ben-Nun , Alexandros Nikolaos Ziogas , Timo Schneider , Maciej Besta , Torsten Hoefler

Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication

We propose a novel approach to iterated sparse matrix dense matrix multiplication, a fundamental computational kernel in scientific computing and graph neural network training. In cases where matrix sizes exceed the memory of a single…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-21 Lukas Gianinazzi , Alexandros Nikolaos Ziogas , Langwen Huang , Piotr Luczynski , Saleh Ashkboos , Florian Scheidl , Armon Carigiet , Chio Ge , Nabil Abubaker , Maciej Besta , Tal Ben-Nun , Torsten Hoefler

Low-Bandwidth Matrix Multiplication: Faster Algorithms and More General Forms of Sparsity

In prior work, Gupta et al. (SPAA 2022) presented a distributed algorithm for multiplying sparse $n \times n$ matrices, using $n$ computers. They assumed that the input matrices are uniformly sparse--there are at most $d$ non-zeros in each…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-24 Chetan Gupta , Janne H. Korhonen , Jan Studený , Jukka Suomela , Hossein Vahidi