Related papers: High Performance Matrix Multiplication
Matrix multiplication is fundamental in the backpropagation algorithm used to train deep neural network models. Libraries like Intel's MKL or NVIDIA's cuBLAS implemented new and optimized matrix multiplication techniques that increase…
Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily…
The resurgence of machine learning has increased the demand for high-performance basic linear algebra subroutines (BLAS), which have long depended on libraries to achieve peak performance on commodity hardware. High-performance BLAS…
KBLAS is a new open source high performance library that provides optimized kernels for a subset of Level 2 BLAS functionalities on CUDA-enabled GPUs. Since performance of dense matrix-vector multiplication is hindered by the overhead of…
Designing problems using matrices is very important in Computer Science. Fields like graph computer, graphs theory, and machine learning use matrices very often to solve their own problems. The most often matrix operation is the…
As users and developers, we are witnessing the opening of a new computing scenario: the introduction of hybrid processors into a single die, such as an accelerated processing unit (APU) processor, and the plug-and-play of additional…
Linear algebra routines are basic building blocks for the statistical software. In this paper we analyzed how can we can improve R performance for matrix computations. We benchmarked few matrix operations using the standard linear algebra…
Matrix multiplication is a fundamental computation in many scientific disciplines. In this paper, we show that novel fast matrix multiplication algorithms can significantly outperform vendor implementations of the classical algorithm and…
General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which…
Scientific programmers often turn to vendor-tuned Basic Linear Algebra Subprograms (BLAS) to obtain portable high performance. However, many numerical algorithms require several BLAS calls in sequence, and those successive calls result in…
Matrix multiplication is the bedrock in Deep Learning inference application. When it comes to hardware acceleration on edge computing devices, matrix multiplication often takes up a great majority of the time. To achieve better performance…
Hierarchical matrices approximate a given matrix by a decomposition into low-rank submatrices that can be handled efficiently in factorized form. $\mathcal{H}^2$-matrices refine this representation following the ideas of fast multipole…
Although reliable long precision floating-point arithmetic libraries such as QD and MPFR/GMP are necessary to solve ill-conditioned problems in numerical simulation, long precision BLAS-level computation such as matrix multiplication has…
Generalized sparse matrix-matrix multiplication is a key primitive for many high performance graph algorithms as well as some linear solvers such as multigrid. We present the first parallel algorithms that achieve increasing speedups for an…
The introduction of the Basic Linear Algebra Subroutine (BLAS) in the 1970s paved the way for different libraries to solve the same problem with an improved approach and hardware. The new BLAS implementation led to High-Performance…
Matrix multiplication is integral to various scientific and engineering disciplines, including machine learning, image processing, and gaming. With the increasing data volumes in areas like machine learning, the demand for efficient…
Matrix multiplication is a fundamental operation in both training of neural networks and inference. To accelerate matrix multiplication, Graphical Processing Units (GPUs) provide it implemented in hardware. Due to the increased throughput…
It is known that the multiplication of an $N \times M$ matrix with an $M \times P$ matrix can be performed using fewer multiplications than what the naive $NMP$ approach suggests. The most famous instance of this is Strassen's algorithm for…
Matrix multiplication (hereafter we use the acronym MM) is among the most fundamental operations of modern computations. The efficiency of its performance depends on various factors, in particular vectorization, data movement and arithmetic…
Sparse matrix multiplication is an important component of linear algebra computations. Implementing sparse matrix multiplication on an associative processor (AP) enables high level of parallelism, where a row of one matrix is multiplied in…