Related papers: Giga-scale Kernel Matrix Vector Multiplication on …

The Fast Kernel Transform

Kernel methods are a highly effective and widely used collection of modern machine learning algorithms. A fundamental limitation of virtually all such methods are computations involving the kernel matrix that naively scale quadratically…

Machine Learning · Computer Science 2021-06-09 John Paul Ryan , Sebastian Ament , Carla P. Gomes , Anil Damle

Low-Rank GEMM: Efficient Matrix Multiplication via Low-Rank Approximation with FP8 Acceleration

Large matrix multiplication is a cornerstone of modern machine learning workloads, yet traditional approaches suffer from cubic computational complexity (e.g., $\mathcal{O}(n^3)$ for a matrix of size $n\times n$). We present Low-Rank GEMM,…

Performance · Computer Science 2025-11-25 Alfredo Metere

Performance Acceleration of Kernel Polynomial Method Applying Graphics Processing Units

The Kernel Polynomial Method (KPM) is one of the fast diagonalization methods used for simulations of quantum systems in research fields of condensed matter physics and chemistry. The algorithm has a difficulty to be parallelized on a…

Computational Physics · Physics 2011-05-30 Shixun Zhang , Shinichi Yamagiwa , Masahiko Okumura , Seiji Yunoki

PBBFMM3D: a parallel black-box algorithm for kernel matrix-vector multiplication

Kernel matrix-vector product is ubiquitous in many science and engineering applications. However, a naive method requires $O(N^2)$ operations, which becomes prohibitive for large-scale problems. We introduce a parallel method that provably…

Mathematical Software · Computer Science 2021-04-30 Ruoxi Wang , Chao Chen , Jonghyun Lee , Eric Darve

Precision-Energy-Throughput Scaling Of Generic Matrix Multiplication and Convolution Kernels Via Linear Projections

Generic matrix multiplication (GEMM) and one-dimensional convolution/cross-correlation (CONV) kernels often constitute the bulk of the compute- and memory-intensive processing within image/audio recognition and matching systems. We propose…

Multimedia · Computer Science 2014-11-12 Mohammad Ashraful Anam , Paul N. Whatmough , Yiannis Andreopoulos

Fast Gaussian process inference by exact Mat\'ern kernel decomposition

To speed up Gaussian process inference, a number of fast kernel matrix-vector multiplication (MVM) approximation algorithms have been proposed over the years. In this paper, we establish an exact fast kernel MVM algorithm based on exact…

Machine Learning · Statistics 2025-08-05 Nicolas Langrené , Xavier Warin , Pierre Gruet

Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining

Scaling up the sparse matrix-vector multiplication kernel on modern Graphics Processing Units (GPU) has been at the heart of numerous studies in both academia and industry. In this article we present a novel non-parametric, self-tunable,…

Numerical Analysis · Computer Science 2012-12-24 Xintian Yang , Srinivasan Parthasarathy , Ponnuswamy Sadayappan

A SVD accelerated kernel-independent fast multipole method and its application to BEM

The kernel-independent fast multipole method (KIFMM) proposed in [1] is of almost linear complexity. In the original KIFMM the time-consuming M2L translations are accelerated by FFT. However, when more equivalent points are used to achieve…

Numerical Analysis · Computer Science 2015-03-19 Yanchuang Cao , Lihua Wen , Junjie Rong

Kernel methods through the roof: handling billions of points efficiently

Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems, since na\"ive implementations scale poorly with data size. Recent advances have shown the benefits…

Machine Learning · Computer Science 2020-11-30 Giacomo Meanti , Luigi Carratino , Lorenzo Rosasco , Alessandro Rudi

An Ultra-high-speed Reproducing Kernel Particle Method

In this work, the fast-convolving reproducing kernel particle method (FC-RKPM) is introduced. This method is hundreds to millions of times faster than the traditional RKPM for 3D meshfree simulations. In this approach, the meshfree…

Numerical Analysis · Mathematics 2024-04-01 Siavash Jafarzadeh , Michael Hillman

Generating Families of Practical Fast Matrix Multiplication Algorithms

Matrix multiplication (GEMM) is a core operation to numerous scientific applications. Traditional implementations of Strassen-like fast matrix multiplication (FMM) algorithms often do not perform well except for very large matrix sizes, due…

Mathematical Software · Computer Science 2016-11-04 Jianyu Huang , Leslie Rice , Devin A. Matthews , Robert A. van de Geijn

Fast Evaluation of Additive Kernels: Feature Arrangement, Fourier Methods, and Kernel Derivatives

One of the main computational bottlenecks when working with kernel based learning is dealing with the large and typically dense kernel matrix. Techniques dealing with fast approximations of the matrix vector product for these kernel…

Machine Learning · Computer Science 2024-04-29 Theresa Wagner , Franziska Nestler , Martin Stoll

Compiler-Level Matrix Multiplication Optimization for Deep Learning

An important linear algebra routine, GEneral Matrix Multiplication (GEMM), is a fundamental operator in deep learning. Compilers need to translate these routines into low-level code optimized for specific hardware. Compiler-level…

Machine Learning · Computer Science 2019-09-25 Huaqing Zhang , Xiaolin Cheng , Hui Zang , Dae Hoon Park

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

General Matrix Multiplication (GEMM) is a crucial algorithm for various applications such as machine learning and scientific computing, and an efficient GEMM implementation is essential for the performance of these systems. While…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-03 Shixun Wu , Yujia Zhai , Jinyang Liu , Jiajun Huang , Zizhe Jian , Bryan M. Wong , Zizhong Chen

SMM-Conv: Scalar Matrix Multiplication with Zero Packing for Accelerated Convolution

We present a novel approach for accelerating convolutions during inference for CPU-based architectures. The most common method of computation involves packing the image into the columns of a matrix (im2col) and performing general matrix…

Computer Vision and Pattern Recognition · Computer Science 2024-11-26 Amir Ofir , Gil Ben-Artzi

A Tuned and Scalable Fast Multipole Method as a Preeminent Algorithm for Exascale Systems

Among the algorithms that are likely to play a major role in future exascale computing, the fast multipole method (FMM) appears as a rising star. Our previous recent work showed scaling of an FMM on GPU clusters, with problem sizes in the…

Numerical Analysis · Computer Science 2012-10-30 Rio Yokota , Lorena Barba

Stream-K++: Adaptive GPU GEMM Kernel Scheduling and Selection using Bloom Filters

General matrix multiplication (GEMM) operations are the fundamental building blocks of computational domains including artificial intelligence (AI). As GPU architectures evolve and high-performance AI becomes increasingly important,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-26 Harisankar Sadasivan , Muhammed Emin Ozturk , Muhammad Osama , Chris Millette , Astha Rai , Maksim Podkorytov , John Afaganis , Carlus Huang , Jing Zhang , Jun Liu

Accelerating Matrix Multiplication: A Performance Comparison Between Multi-Core CPU and GPU

Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-30 Mufakir Qamar Ansari , Mudabir Qamar Ansari

Memory and Computation-Efficient Kernel SVM via Binary Embedding and Ternary Model Coefficients

Kernel approximation is widely used to scale up kernel SVM training and prediction. However, the memory and computation costs of kernel approximation models are still too high if we want to deploy them on memory-limited devices such as…

Machine Learning · Computer Science 2020-10-07 Zijian Lei , Liang Lan

Accelerating Graph Neural Networks with a Novel Matrix Compression Format

The inference and training stages of Graph Neural Networks (GNNs) are often dominated by the time required to compute a long sequence of matrix multiplications between the sparse graph adjacency matrix and its embedding. To accelerate these…

Data Structures and Algorithms · Computer Science 2024-09-05 João N. F. Alves , Samir Moustafa , Siegfried Benkner , Alexandre P. Francisco , Wilfried N. Gansterer , Luís M. S. Russo