Related papers: Cache-oblivious Matrix Multiplication for Exact Fa…
We examine several matrix layouts based on space-filling curves that allow for a cache-oblivious adaptation of parallel TU decomposition for rectangular matrices over finite fields. The TU algorithm of \cite{Dumas} requires index conversion…
Classic cache-oblivious parallel matrix multiplication algorithms achieve optimality either in time or space, but not both, which promotes lots of research on the best possible balance or tradeoff of such algorithms. We study modern…
We investigate effects of ordering in blocked matrix--matrix multiplication. We find that submatrices do not have to be stored contiguously in memory to achieve near optimal performance. Instead it is the choice of execution order of the…
Matrix multiplication (MatMul) is the computational backbone of modern machine learning, yet its classical complexity remains a bottleneck for large-scale data processing. We propose a hybrid quantum-classical algorithm for matrix…
This paper studies two variants of tiling: iteration space tiling (or loop blocking) and cache-oblivious methods that recursively split the iteration space with divide-and-conquer. The key question to answer is when we should be using one…
This paper presents an efficient technique for matrix-vector and vector-transpose-matrix multiplication in distributed-memory parallel computing environments, where the matrices are unstructured, sparse, and have a substantially larger…
We study matrix-matrix multiplication of two matrices, $A$ and $B$, each of size $n \times n$. This operation results in a matrix $C$ of size $n\times n$. Our goal is to produce $C$ as efficiently as possible given a cache: a 1-D limited…
Hierarchical matrices approximate a given matrix by a decomposition into low-rank submatrices that can be handled efficiently in factorized form. $\mathcal{H}^2$-matrices refine this representation following the ideas of fast multipole…
Frigo et al. proposed an ideal cache model and a recursive technique to design sequential cache-efficient algorithms in a cache-oblivious fashion. Ballard et al. pointed out that it is a fundamental open problem to extend the technique to…
The multiplication of a matrix by its transpose, $A^T A$, appears as an intermediate operation in the solution of a wide set of problems. In this paper, we propose a new cache-oblivious algorithm (ATA) for computing this product, based upon…
Matrix multiplication is a fundamental computation in many scientific disciplines. In this paper, we show that novel fast matrix multiplication algorithms can significantly outperform vendor implementations of the classical algorithm and…
We consider the related tasks of matrix completion and matrix approximation from missing data and propose adaptive sampling procedures for both problems. We show that adaptive sampling allows one to eliminate standard incoherence…
Fast approximations to matrix multiplication have the potential to dramatically reduce the cost of neural network inference. Recent work on approximate matrix multiplication proposed to replace costly multiplications with table-lookups by…
In this paper we present randomized algorithms for sorting and convex hull that achieves optimal performance (for speed-up and cache misses) on the multicore model with private cache model. Our algorithms are cache oblivious and generalize…
Matrix extensions have emerged as an essential feature in modern CPUs to address the surging demands of AI workloads. However, existing designs often incur substantial hardware and software design overhead. Tight coupling with the CPU…
We present new algorithms to detect and correct errors in the product of two matrices, or the inverse of a matrix, over an arbitrary field. Our algorithms do not require any additional information or encoding other than the original inputs…
Renewed interest in mixed-precision algorithms has emerged due to growing data capacity and bandwidth concerns, as well as the advancement of GPUs, which enable significant speedup for low precision arithmetic. In light of this, we propose…
We present two cache-oblivious sorting-based convex hull algorithms in the Binary Forking Model. The first is an algorithm for a presorted set of points which achieves $O(n)$ work, $O(\log n)$ span, and $O(n/B)$ serial cache complexity,…
We study the problem of exact completion for $m \times n$ sized matrix of rank $r$ with the adaptive sampling method. We introduce a relation of the exact completion problem with the sparsest vector of column and row spaces (which we call…
In this paper, we introduce novel fast matrix inversion algorithms that leverage triangular decomposition and recurrent formalism, incorporating Strassen's fast matrix multiplication. Our research places particular emphasis on triangular…