Related papers: Seer: Predictive Runtime Kernel Selection for Irre…

Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference

It has long been a problem to arrange and execute irregular workloads on massively parallel devices. We propose a general framework for statically batching irregular workloads into a single kernel with a runtime task mapping mechanism on…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-28 Yinghan Li , Yifei Li , Jiejing Zhang , Bujiao Chen , Xiaotong Chen , Lian Duan , Yejun Jin , Zheng Li , Xuanyu Liu , Haoyu Wang , Wente Wang , Yajie Wang , Jiacheng Yang , Peiyang Zhang , Laiwen Zheng , Wenyuan Yu

Adaptive SpMV/SpMSpV on GPUs for Input Vectors of Varied Sparsity

Despite numerous efforts for optimizing the performance of Sparse Matrix and Vector Multiplication (SpMV) on modern hardware architectures, few works are done to its sparse counterpart, Sparse Matrix and Sparse Vector Multiplication…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-18 Min Li , Yulong Ao , Chao Yang

Optimal Kernel Tuning Parameter Prediction using Deep Sequence Models

GPU kernels have come to the forefront of computing due to their utility in varied fields, from high-performance computing to machine learning. A typical GPU compute kernel is invoked millions, if not billions of times in a typical…

Machine Learning · Computer Science 2024-04-18 Khawir Mahmood , Jehandad Khan , Hammad Afzal

SEEK: Self-adaptive Explainable Kernel For Nonstationary Gaussian Processes

Gaussian processes (GPs) are powerful probabilistic models that define flexible priors over functions, offering strong interpretability and uncertainty quantification. However, GP models often rely on simple, stationary kernels which can…

Machine Learning · Computer Science 2025-05-20 Nima Negarandeh , Carlos Mora , Ramin Bostanabad

Random Binary Mappings for Kernel Learning and Efficient SVM

Support Vector Machines (SVMs) are powerful learners that have led to state-of-the-art results in various computer vision problems. SVMs suffer from various drawbacks in terms of selecting the right kernel, which depends on the image…

Computer Vision and Pattern Recognition · Computer Science 2014-03-31 Gemma Roig , Xavier Boix , Luc Van Gool

A Tool for Automatically Suggesting Source-Code Optimizations for Complex GPU Kernels

Future computing systems, from handhelds to supercomputers, will undoubtedly be more parallel and heterogeneous than todays systems to provide more performance and energy efficiency. Thus, GPUs are increasingly being used to accelerate…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-18 Saeed Taheri , Apan Qasem , Martin Burtscher

A lightweight optimization selection method for Sparse Matrix-Vector Multiplication

In this paper, we propose an optimization selection methodology for the ubiquitous sparse matrix-vector multiplication (SpMV) kernel. We propose two models that attempt to identify the major performance bottleneck of the kernel for every…

Performance · Computer Science 2016-01-12 Athena Elafrou , Georgios Goumas , Nectarios Koziris

Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining

Scaling up the sparse matrix-vector multiplication kernel on modern Graphics Processing Units (GPU) has been at the heart of numerous studies in both academia and industry. In this article we present a novel non-parametric, self-tunable,…

Numerical Analysis · Computer Science 2012-12-24 Xintian Yang , Srinivasan Parthasarathy , Ponnuswamy Sadayappan

GPU Load Balancing

Fine-grained workload and resource balancing is the key to high performance for regular and irregular computations on the GPUs. In this dissertation, we conduct an extensive survey of existing load-balancing techniques to build an…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-12-20 Muhammad Osama

Auto-SpMV: Automated Optimizing SpMV Kernels on GPU

Sparse matrix-vector multiplication (SpMV) is an essential linear algebra operation that dominates the computing cost in many scientific applications. Due to providing massive parallelism and high memory bandwidth, GPUs are commonly used to…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-02-14 Mina Ashoury , Mohammad Loni , Farshad Khunjush , Masoud Daneshtalab

Kernel methods through the roof: handling billions of points efficiently

Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems, since na\"ive implementations scale poorly with data size. Recent advances have shown the benefits…

Machine Learning · Computer Science 2020-11-30 Giacomo Meanti , Luigi Carratino , Lorenzo Rosasco , Alessandro Rudi

A mechanism for balancing accuracy and scope in cross-machine black-box GPU performance modeling

The ability to model, analyze, and predict execution time of computations is an important building block supporting numerous efforts, such as load balancing, performance optimization, and automated performance tuning for high performance,…

Performance · Computer Science 2020-06-22 James D. Stevens , Andreas Klöckner

MERBIT: A GPU-Based SpMV Method for Iterative Workloads

Sparse Matrix-Vector Multiplication (SpMV) is the cornerstone in many iterative workloads, including large-scale graph analytics and sparse iterative solvers. Accelerating SpMV on real-world graphs remains challenging due to highly…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-11 Qi Zhang , Zhengan Yao , Zhenglu Jiang , Zan-Bo Zhang

RSH-SpMM: A Row-Structured Hybrid Kernel for Sparse Matrix-Matrix Multiplication on GPUs

Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental computation in graph analytics, scientific simulation, and sparse deep learning workloads. However, the extreme irregularity of real-world sparse matrices prevents existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-11 Aiying Li , Jingwei Sun , Han Li , Wence Ji , Guangzhong Sun

Reordering GPU Kernel Launches to Enable Efficient Concurrent Execution

Contemporary GPUs allow concurrent execution of small computational kernels in order to prevent idling of GPU resources. Despite the potential concurrency between independent kernels, the order in which kernels are issued to the GPU will…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-11-26 Teng Li , Vikram K. Narayana , Tarek El-Ghazawi

Ocean: Fast Estimation-Based Sparse General Matrix-Matrix Multiplication on GPU

In computational science and data analytics, many workloads involve irregular and sparse computations that are inherently difficult to optimize for modern hardware. A key kernel is Sparse General Matrix-Matrix Multiplication (SpGEMM), which…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-22 Yifan Li , Giulia Guidi

Kernel Clustering with Sigmoid-based Regularization for Efficient Segmentation of Sequential Data

Kernel segmentation aims at partitioning a data sequence into several non-overlapping segments that may have nonlinear and complex structures. In general, it is formulated as a discrete optimization problem with combinatorial constraints. A…

Machine Learning · Computer Science 2022-06-23 Tung Doan , Atsuhiro Takasu

A New Sparse Matrix Vector Multiplication GPU Algorithm Designed for Finite Element Problems

Recently, graphics processors (GPUs) have been increasingly leveraged in a variety of scientific computing applications. However, architectural differences between CPUs and GPUs necessitate the development of algorithms that take advantage…

Mathematical Software · Computer Science 2015-01-05 Jonathan Wong , Ellen Kuhl , Eric Darve

Sparsity-Specific Code Optimization using Expression Trees

We introduce a code generator that converts unoptimized C++ code operating on sparse data into vectorized and parallel CPU or GPU kernels. Our approach unrolls the computation into a massive expression graph, performs redundant expression…

Programming Languages · Computer Science 2022-03-15 Philipp Herholz , Xuan Tang , Teseo Schneider , Shoaib Kamil , Daniele Panozzo , Olga Sorkine-Hornung

Kernel machines that adapt to GPUs for effective large batch training

Modern machine learning models are typically trained using Stochastic Gradient Descent (SGD) on massively parallel computing resources such as GPUs. Increasing mini-batch size is a simple and direct way to utilize the parallel computing…

Machine Learning · Statistics 2019-03-05 Siyuan Ma , Mikhail Belkin