Related papers: GPU Load Balancing

A Programming Model for GPU Load Balancing

We propose a GPU fine-grained load-balancing abstraction that decouples load balancing from work processing and aims to support both static and dynamic schedules with a programmable interface to implement new load-balancing schedules. Prior…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-13 Muhammad Osama , Serban D. Porumbescu , John D. Owens

Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU

We introduce Stream-K, a work-centric parallelization of matrix multiplication (GEMM) and related computations in dense linear algebra. Whereas contemporary decompositions are primarily tile-based, our method operates by partitioning an…

Data Structures and Algorithms · Computer Science 2023-01-11 Muhammad Osama , Duane Merrill , Cris Cecka , Michael Garland , John D. Owens

An Adaptive Load Balancer For Graph Analytical Applications on GPUs

Load-balancing among the threads of a GPU for graph analytics workloads is difficult because of the irregular nature of graph applications and the high variability in vertex degrees, particularly in power-law graphs. We describe a novel…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-28 Vishwesh Jatala , Loc Hoang , Roshan Dathathri , Gurbinder Gill , V Krishna Nandivada , Keshav Pingali

Stream-K++: Adaptive GPU GEMM Kernel Scheduling and Selection using Bloom Filters

General matrix multiplication (GEMM) operations are the fundamental building blocks of computational domains including artificial intelligence (AI). As GPU architectures evolve and high-performance AI becomes increasingly important,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-26 Harisankar Sadasivan , Muhammed Emin Ozturk , Muhammad Osama , Chris Millette , Astha Rai , Maksim Podkorytov , John Afaganis , Carlus Huang , Jing Zhang , Jun Liu

Dynamic Load Balancing Strategies for Graph Applications on GPUs

Acceleration of graph applications on GPUs has found large interest due to the ubiquitous use of graph processing in various domains. The inherent \textit{irregularity} in graph applications leads to several challenges for parallelization.…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-02 Ananya Raval , Rupesh Nasre , Vivek Kumar , Vasudevan R , Sathish Vadhiyar , Keshav Pingali

In-Situ Assessment of Device-Side Compute Work for Dynamic Load Balancing in a GPU-Accelerated PIC Code

Maintaining computational load balance is important to the performant behavior of codes which operate under a distributed computing model. This is especially true for GPU architectures, which can suffer from memory oversubscription if…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-05 Michael E. Rowan , Axel Huebl , Kevin N. Gott , Jack Deslippe , Maxence Thévenet , Remi Lehe , Jean-Luc Vay

Contention-Aware GPU Partitioning and Task-to-Partition Allocation for Real-Time Workloads

In order to satisfy timing constraints, modern real-time applications require massively parallel accelerators such as General Purpose Graphic Processing Units (GPGPUs). Generation after generation, the number of computing clusters made…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-24 Houssam-Eddine Zahaf , Ignacio Sanudo Olmedo , Jayati Singh , Nicola Capodieci , Sebastien Faucou

Balanced 3DGS: Gaussian-wise Parallelism Rendering with Fine-Grained Tiling

3D Gaussian Splatting (3DGS) is increasingly attracting attention in both academia and industry owing to its superior visual quality and rendering speed. However, training a 3DGS model remains a time-intensive task, especially in load…

Computer Vision and Pattern Recognition · Computer Science 2025-05-09 Hao Gui , Lin Hu , Rui Chen , Mingxiao Huang , Yuxin Yin , Jin Yang , Yong Wu , Chen Liu , Zhongxu Sun , Xueyang Zhang , Kun Zhan

Exploration of Fine-Grained Parallelism for Load Balancing Eager K-truss on GPU and CPU

In this work we present a performance exploration on Eager K-truss, a linear-algebraic formulation of the K-truss graph algorithm. We address performance issues related to load imbalance of parallel tasks in symmetric, triangular graphs by…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-18 Mark Blanco , Tze Meng Low , Kyungjoo Kim

A mechanism for balancing accuracy and scope in cross-machine black-box GPU performance modeling

The ability to model, analyze, and predict execution time of computations is an important building block supporting numerous efforts, such as load balancing, performance optimization, and automated performance tuning for high performance,…

Performance · Computer Science 2020-06-22 James D. Stevens , Andreas Klöckner

A GPU-Accelerated Distributed Algorithm for Optimal Power Flow in Distribution Systems

We propose a GPU-accelerated distributed optimization algorithm for controlling multi-phase optimal power flow in active distribution systems with dynamically changing topologies. To handle varying network configurations and enable…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-15 Minseok Ryu , Geunyeong Byeon , Kibaek Kim

Skew Handling in Aggregate Streaming Queries on GPUs

Nowadays, the data to be processed by database systems has grown so large that any conventional, centralized technique is inadequate. At the same time, general purpose computation on GPU (GPGPU) recently has successfully drawn attention…

Databases · Computer Science 2013-09-04 Georgios Koutsoumpakis , Iakovos Koutsoumpakis , Anastasios Gounaris

Fine-grained MoE Load Balancing with Linear Programming

Mixture-of-Experts (MoE) has emerged as a promising approach to scale up deep learning models due to its significant reduction in computational resources. However, the dynamic nature of MoE leads to load imbalance among experts, severely…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-16 Chenqi Zhao , Wenfei Wu , Linhai Song , Yuchen Xu , Yitao Yuan

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

General Matrix Multiplication (GEMM) is a crucial algorithm for various applications such as machine learning and scientific computing, and an efficient GEMM implementation is essential for the performance of these systems. While…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-03 Shixun Wu , Yujia Zhai , Jinyang Liu , Jiajun Huang , Zizhe Jian , Bryan M. Wong , Zizhong Chen

A GPU-based Distributed Algorithm for Linearized Optimal Power Flow in Distribution Systems

We propose a GPU-based distributed optimization algorithm, aimed at controlling optimal power flow in multi-phase and unbalanced distribution systems. Typically, conventional distributed optimization algorithms employed in such scenarios…

Optimization and Control · Mathematics 2023-10-17 Minseok Ryu , Geunyeong Byeon , Kibaek Kim

Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference

It has long been a problem to arrange and execute irregular workloads on massively parallel devices. We propose a general framework for statically batching irregular workloads into a single kernel with a runtime task mapping mechanism on…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-28 Yinghan Li , Yifei Li , Jiejing Zhang , Bujiao Chen , Xiaotong Chen , Lian Duan , Yejun Jin , Zheng Li , Xuanyu Liu , Haoyu Wang , Wente Wang , Yajie Wang , Jiacheng Yang , Peiyang Zhang , Laiwen Zheng , Wenyuan Yu

Compiler-Assisted Workload Consolidation For Efficient Dynamic Parallelism on GPU

GPUs have been widely used to accelerate computations exhibiting simple patterns of parallelism - such as flat or two-level parallelism - and a degree of parallelism that can be statically determined based on the size of the input dataset.…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-18 Hancheng Wu , Da Li , Michela Becchi

Understanding GPU Resource Interference One Level Deeper

GPUs are vastly underutilized, even when running resource-intensive AI applications, as GPU kernels within each job have diverse resource profiles that may saturate some parts of a device while often leaving other parts idle. Colocating…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-17 Paul Elvinger , Foteini Strati , Natalie Enright Jerger , Ana Klimovic

Understanding GEMM Performance and Energy on NVIDIA Ada Lovelace: A Machine Learning-Based Analytical Approach

Analytical framework for predicting General Matrix Multiplication (GEMM) performance on modern GPUs, focusing on runtime, power consumption, and energy efficiency. Our study employs two approaches: a custom-implemented tiled matrix…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-27 Xiaoteng , Liu , Pavly Halim

Themis: Fair and Efficient GPU Cluster Scheduling

Modern distributed machine learning (ML) training workloads benefit significantly from leveraging GPUs. However, significant contention ensues when multiple such workloads are run atop a shared cluster of GPUs. A key question is how to…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-30 Kshiteej Mahajan , Arjun Balasubramanian , Arjun Singhvi , Shivaram Venkataraman , Aditya Akella , Amar Phanishayee , Shuchi Chawla