GPUs are vastly underutilized, even when running resource-intensive AI applications, as GPU kernels within each job have diverse resource profiles that may saturate some parts of a device while often leaving other parts idle. Colocating applications is known to improve GPU utilization, but is not common practice as it becomes difficult to provide predictable performance due to workload interference. Providing predictable performance guarantees requires a deep understanding of how applications contend for shared GPU resources such as block schedulers, compute units, L1/L2 caches, and memory bandwidth. We study the key types of GPU resource interference and develop a methodology to quantify the sensitivity of a workload to each type. We discuss how this methodology can serve as the foundation for GPU schedulers that enforce strict performance guarantees and how application developers can design GPU kernels with colocation in mind to improve efficiency.
@article{arxiv.2501.16909,
title = {Understanding GPU Resource Interference One Level Deeper},
author = {Paul Elvinger and Foteini Strati and Natalie Enright Jerger and Ana Klimovic},
journal= {arXiv preprint arXiv:2501.16909},
year = {2026}
}