Understanding GPU Resource Interference One Level Deeper

Paul Elvinger; Foteini Strati; Natalie Enright Jerger; Ana Klimovic

Understanding GPU Resource Interference One Level Deeper

Distributed, Parallel, and Cluster Computing 2026-02-17 v3

Authors: Paul Elvinger , Foteini Strati , Natalie Enright Jerger , Ana Klimovic

Abstract

GPUs are vastly underutilized, even when running resource-intensive AI applications, as GPU kernels within each job have diverse resource profiles that may saturate some parts of a device while often leaving other parts idle. Colocating applications is known to improve GPU utilization, but is not common practice as it becomes difficult to provide predictable performance due to workload interference. Providing predictable performance guarantees requires a deep understanding of how applications contend for shared GPU resources such as block schedulers, compute units, L1/L2 caches, and memory bandwidth. We study the key types of GPU resource interference and develop a methodology to quantify the sensitivity of a workload to each type. We discuss how this methodology can serve as the foundation for GPU schedulers that enforce strict performance guarantees and how application developers can design GPU kernels with colocation in mind to improve efficiency.

Keywords

gpu computing large language model inference scheduling

Cite

@article{arxiv.2501.16909,
  title  = {Understanding GPU Resource Interference One Level Deeper},
  author = {Paul Elvinger and Foteini Strati and Natalie Enright Jerger and Ana Klimovic},
  journal= {arXiv preprint arXiv:2501.16909},
  year   = {2026}
}

Understanding GPU Resource Interference One Level Deeper

Abstract

Keywords

Cite

Related papers