Related papers: Communication-optimal Parallel and Sequential Chol…
In 1981 Hong and Kung proved a lower bound on the amount of communication needed to perform dense, matrix-multiplication using the conventional $O(n^3)$ algorithm, where the input matrices were too large to fit in the small, fast memory. In…
Matrix factorizations are among the most important building blocks of scientific computing. State-of-the-art libraries, however, are not communication-optimal, underutilizing current parallel architectures. We present novel algorithms for…
Algorithms have two costs: arithmetic and communication. The latter represents the cost of moving data, either between levels of a memory hierarchy, or between processors over a network. Communication often dominates arithmetic and…
In this paper, we consider two fundamental symmetric kernels in linear algebra: the Cholesky factorization and the symmetric rank-$k$ update (SYRK), with the classical three nested loops algorithms for these kernels. In addition, we…
We present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform, and just as stable as Householder QR. We prove optimality by extending…
In this article, we focus on the communication costs of three symmetric matrix computations: i) multiplying a matrix with its transpose, known as a symmetric rank-k update (SYRK) ii) adding the result of the multiplication of a matrix with…
Scalable QR factorization algorithms for solving least squares and eigenvalue problems are critical given the increasing parallelism within modern machines. We introduce a more general parallelization of the CholeskyQR2 algorithm and show…
Sketching is widely used in randomized linear algebra for low-rank matrix approximation, column subset selection, and many other problems, and it has gained significant traction in machine learning applications. However, sketching large…
Interprocessor communication often dominates the runtime of large matrix computations. We present a parallel algorithm for computing QR decompositions whose bandwidth cost (communication volume) can be decreased at the cost of increasing…
The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication…
Dense linear algebra kernels, such as linear solvers or tensor contractions, are fundamental components of many scientific computing applications. In this work, we present a novel method of deriving parallel I/O lower bounds for this broad…
In this paper, we study the communication complexity for the problem of computing a conjunctive query on a large database in a parallel setting with $p$ servers. In contrast to previous work, where upper and lower bounds on the…
In this paper we study the tradeoff between parallelism and communication cost in a map-reduce computation. For any problem that is not "embarrassingly parallel," the finer we partition the work of the reducers so that more parallelism can…
Multiple Tensor-Times-Matrix (Multi-TTM) is a key computation in algorithms for computing and operating with the Tucker tensor decomposition, which is frequently used in multidimensional data analysis. We establish communication lower…
In this article, we focus on the parallel communication cost of multiplying the same vector along two modes of a $3$-dimensional symmetric tensor. This is a key computation in the higher-order power method for determining eigenpairs of a…
The matricized-tensor times Khatri-Rao product computation is the typical bottleneck in algorithms for computing a CP decomposition of a tensor. In order to develop high performance sequential and parallel algorithms, we establish…
We present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform, and just as stable as Householder QR. Our first algorithm, Tall Skinny…
This paper studies the distributed linearly separable computation problem, which is a generalization of many existing distributed computing problems such as distributed gradient descent and distributed linear transform. In this problem, a…
In the research area of parallel computation, the communication cost has been extensively studied, while the IO cost has been neglected. For big data computation, the assumption that the data fits in main memory no longer holds, and…
We review strategies for differentiating matrix-based computations, and derive symbolic and algorithmic update rules for differentiating expressions containing the Cholesky decomposition. We recommend new `blocked' algorithms, based on…