Jack Dongarra
Ootomo, Ozaki, and Yokota [Int. J. High Perform. Comput. Appl., 38 (2024), p. 297-313] have proposed a strategy to recast a floating-point matrix multiplication in terms of integer matrix products. The factors A and B are split into integer…
We present a mixed-precision benchmark called HPL-MxP that uses both a lower-precision LU factorization with a non-stationary iterative refinement based on GMRES. We evaluate the numerical stability of one of the methods of generating the…
The block elimination with additive modifications (BEAM) method was recently proposed as a alternative to LU with partial pivoting requiring less communication. Because of the novelty of BEAM, the existing theoretical analysis is lacking.…
The evolution of floating-point computation has been shaped by algorithmic advancements, architectural innovations, and the increasing computational demands of modern technologies, such as artificial intelligence (AI) and high-performance…
Parker and L\^e introduced random butterfly transforms (RBTs) as a preprocessing technique to replace pivoting in dense LU factorization. Unfortunately, their FFT-like recursive structure restricts the dimensions of the matrix. Furthermore,…
HPC and Cloud have evolved independently, specializing their innovations into performance or productivity. Acceleration as a Service (XaaS) is a recipe to empower both fields with a shared execution platform that provides transparent access…
Randomized numerical linear algebra - RandNLA, for short - concerns the use of randomization as a resource to develop improved algorithms for large-scale linear algebra computations. The origins of contemporary RandNLA lay in theoretical…
Numerical exceptions, which may be caused by overflow, operations like division by 0 or sqrt(-1), or convergence failures, are unavoidable in many cases, in particular when software is used on unforeseen and difficult inputs. As more…
The world of computing is in rapid transition, now dominated by a world of smartphones and cloud services, with profound implications for the future of advanced scientific computing. Simply put, high-performance computing (HPC) is at an…
Efficient exploitation of exascale architectures requires rethinking of the numerical algorithms used in many large-scale applications. These architectures favor algorithms that expose ultra fine-grain parallelism and maximize the ratio of…
This paper presents some of the current challenges in designing deep learning artificial intelligence (AI) and integrating it with traditional high-performance computing (HPC) simulations. We evaluate existing packages for their ability to…
The GMRES method is used to solve sparse, non-symmetric systems of linear equations arising from many scientific applications. The solver performance within a single node is memory bound, due to the low arithmetic intensity of its…
Within the past years, hardware vendors have started designing low precision special function units in response to the demand of the Machine Learning community and their demand for high compute power in low precision formats. Also the…
We consider algorithms for going from a "full" matrix to a condensed "band bidiagonal" form using orthogonal transformations. We use the framework of "algorithms by tiles". Within this framework, we study: (i) the tiled bidiagonalization…
Previous studies have reported that common dense linear algebra operations do not achieve speed up by using multiple geographical sites of a computational grid. Because such operations are the building blocks of most scientific…
It is our view that the state of the art in constructing a large collection of graph algorithms in terms of linear algebraic operations is mature enough to support the emergence of a standard set of primitive building blocks. This paper is…
On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and…
This paper introduces hybrid LU-QR al- gorithms for solving dense linear systems of the form Ax = b. Throughout a matrix factorization, these al- gorithms dynamically alternate LU with local pivoting and QR elimination steps, based upon…
This short paper deals with parallel scientific applications using non-blocking and periodic coordinated checkpointing to enforce resilience. We provide a model and detailed formulas for total execution time and consumed energy. We…
This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed multi-core nodes. These platforms make the present and the foreseeable future of…