Related papers: GPU Load Balancing
We propose a GPU fine-grained load-balancing abstraction that decouples load balancing from work processing and aims to support both static and dynamic schedules with a programmable interface to implement new load-balancing schedules. Prior…
We introduce Stream-K, a work-centric parallelization of matrix multiplication (GEMM) and related computations in dense linear algebra. Whereas contemporary decompositions are primarily tile-based, our method operates by partitioning an…
Load-balancing among the threads of a GPU for graph analytics workloads is difficult because of the irregular nature of graph applications and the high variability in vertex degrees, particularly in power-law graphs. We describe a novel…
General matrix multiplication (GEMM) operations are the fundamental building blocks of computational domains including artificial intelligence (AI). As GPU architectures evolve and high-performance AI becomes increasingly important,…
Acceleration of graph applications on GPUs has found large interest due to the ubiquitous use of graph processing in various domains. The inherent \textit{irregularity} in graph applications leads to several challenges for parallelization.…
Maintaining computational load balance is important to the performant behavior of codes which operate under a distributed computing model. This is especially true for GPU architectures, which can suffer from memory oversubscription if…
In order to satisfy timing constraints, modern real-time applications require massively parallel accelerators such as General Purpose Graphic Processing Units (GPGPUs). Generation after generation, the number of computing clusters made…
3D Gaussian Splatting (3DGS) is increasingly attracting attention in both academia and industry owing to its superior visual quality and rendering speed. However, training a 3DGS model remains a time-intensive task, especially in load…
In this work we present a performance exploration on Eager K-truss, a linear-algebraic formulation of the K-truss graph algorithm. We address performance issues related to load imbalance of parallel tasks in symmetric, triangular graphs by…
The ability to model, analyze, and predict execution time of computations is an important building block supporting numerous efforts, such as load balancing, performance optimization, and automated performance tuning for high performance,…
We propose a GPU-accelerated distributed optimization algorithm for controlling multi-phase optimal power flow in active distribution systems with dynamically changing topologies. To handle varying network configurations and enable…
Nowadays, the data to be processed by database systems has grown so large that any conventional, centralized technique is inadequate. At the same time, general purpose computation on GPU (GPGPU) recently has successfully drawn attention…
Mixture-of-Experts (MoE) has emerged as a promising approach to scale up deep learning models due to its significant reduction in computational resources. However, the dynamic nature of MoE leads to load imbalance among experts, severely…
General Matrix Multiplication (GEMM) is a crucial algorithm for various applications such as machine learning and scientific computing, and an efficient GEMM implementation is essential for the performance of these systems. While…
We propose a GPU-based distributed optimization algorithm, aimed at controlling optimal power flow in multi-phase and unbalanced distribution systems. Typically, conventional distributed optimization algorithms employed in such scenarios…
It has long been a problem to arrange and execute irregular workloads on massively parallel devices. We propose a general framework for statically batching irregular workloads into a single kernel with a runtime task mapping mechanism on…
GPUs have been widely used to accelerate computations exhibiting simple patterns of parallelism - such as flat or two-level parallelism - and a degree of parallelism that can be statically determined based on the size of the input dataset.…
GPUs are vastly underutilized, even when running resource-intensive AI applications, as GPU kernels within each job have diverse resource profiles that may saturate some parts of a device while often leaving other parts idle. Colocating…
Analytical framework for predicting General Matrix Multiplication (GEMM) performance on modern GPUs, focusing on runtime, power consumption, and energy efficiency. Our study employs two approaches: a custom-implemented tiled matrix…
Modern distributed machine learning (ML) training workloads benefit significantly from leveraging GPUs. However, significant contention ensues when multiple such workloads are run atop a shared cluster of GPUs. A key question is how to…