Efficient Quantized Sparse Matrix Operations on Tensor Cores

Shigang Li; Kazuki Osawa; Torsten Hoefler

doi:10.1109/SC41404.2022.00042

Efficient Quantized Sparse Matrix Operations on Tensor Cores

Distributed, Parallel, and Cluster Computing 2023-05-09 v4 Machine Learning

Authors: Shigang Li , Kazuki Osawa , Torsten Hoefler

View on arXiv ↗ PDF ↗ DOI ↗

Abstract

The exponentially growing model size drives the continued success of deep learning, but it brings prohibitive computation and memory cost. From the algorithm perspective, model sparsification and quantization have been studied to alleviate the problem. From the architecture perspective, hardware vendors provide Tensor cores for acceleration. However, it is very challenging to gain practical speedups from sparse, low-precision matrix operations on Tensor cores, because of the strict requirements for data layout and lack of support for efficiently manipulating the low-precision integers. We propose Magicube, a high-performance sparse-matrix library for low-precision integers on Tensor cores. Magicube supports SpMM and SDDMM, two major sparse operations in deep learning with mixed precision. Experimental results on an NVIDIA A100 GPU show that Magicube achieves on average 1.44x (up to 2.37x) speedup over the vendor-optimized library for sparse kernels, and 1.43x speedup over the state-of-the-art with a comparable accuracy for end-to-end sparse Transformer inference.

Keywords

sparse matrix multiplication gpu computing mixed precision training

Cite

@article{arxiv.2209.06979,
  title  = {Efficient Quantized Sparse Matrix Operations on Tensor Cores},
  author = {Shigang Li and Kazuki Osawa and Torsten Hoefler},
  journal= {arXiv preprint arXiv:2209.06979},
  year   = {2023}
}

Comments

Published in Proceedings of 2022 International Conference for High Performance Computing, Networking, Storage and Analysis (SC'22), No.: 37, Pages 1-15, Best Paper Finalist, https://dl.acm.org/doi/10.5555/3571885.3571934 (In this arXiv verion, we fix a typo at the bottom right of Page 6: For SDDMM, each thread block needs $\textbf{K/BS}$$_k$ steps to obtain the final results; we fix Table 3.)

Efficient Quantized Sparse Matrix Operations on Tensor Cores

Abstract

Keywords

Cite

Comments

Related papers