English

Implementing Strassen's Algorithm with BLIS

Mathematical Software 2016-05-05 v1

Abstract

We dispel with "street wisdom" regarding the practical implementation of Strassen's algorithm for matrix-matrix multiplication (DGEMM). Conventional wisdom: it is only practical for very large matrices. Our implementation is practical for small matrices. Conventional wisdom: the matrices being multiplied should be relatively square. Our implementation is practical for rank-k updates, where k is relatively small (a shape of importance for libraries like LAPACK). Conventional wisdom: it inherently requires substantial workspace. Our implementation requires no workspace beyond buffers already incorporated into conventional high-performance DGEMM implementations. Conventional wisdom: a Strassen DGEMM interface must pass in workspace. Our implementation requires no such workspace and can be plug-compatible with the standard DGEMM interface. Conventional wisdom: it is hard to demonstrate speedup on multi-core architectures. Our implementation demonstrates speedup over conventional DGEMM even on an Intel(R) Xeon Phi(TM) coprocessor utilizing 240 threads. We show how a distributed memory matrix-matrix multiplication also benefits from these advances.

Keywords

Cite

@article{arxiv.1605.01078,
  title  = {Implementing Strassen's Algorithm with BLIS},
  author = {Jianyu Huang and Tyler M. Smith and Greg M. Henry and Robert A. van de Geijn},
  journal= {arXiv preprint arXiv:1605.01078},
  year   = {2016}
}
R2 v1 2026-06-22T13:52:35.439Z