Related papers: Optimizing GPU Cache Policies for MI Workloads

Understanding GPU Resource Interference One Level Deeper

GPUs are vastly underutilized, even when running resource-intensive AI applications, as GPU kernels within each job have diverse resource profiles that may saturate some parts of a device while often leaving other parts idle. Colocating…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-17 Paul Elvinger , Foteini Strati , Natalie Enright Jerger , Ana Klimovic

A Graph-based Model for GPU Caching Problems

Modeling data sharing in GPU programs is a challenging task because of the massive parallelism and complex data sharing patterns provided by GPU architectures. Better GPU caching efficiency can be achieved through careful task scheduling…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-10-04 Lingda Li , Ari B. Hayes , Stephen A. Hackler , Eddy Z. Zhang , Mario Szegedy , Shuaiwen Leon Song

To Use or Not to Use: CPUs' Cache Optimization Techniques on GPGPUs

General Purpose Graphic Processing Unit(GPGPU) is used widely for achieving high performance or high throughput in parallel programming. This capability of GPGPUs is very famous in the new era and mostly used for scientific computing which…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-10-10 Vajira Thambawita , Roshan G. Ragel , Dhammike Elkaduwe

Cache Bypassing for Machine Learning Algorithms

Graphics Processing Units (GPUs) were once used solely for graphical computation tasks but with the increase in the use of machine learning applications, the use of GPUs to perform general-purpose computing has increased in the last few…

Hardware Architecture · Computer Science 2021-02-16 Asim Ikram , Muhammad Awais Ali , Mirza Omer Beg

Optimal Workload Placement on Multi-Instance GPUs

There is an urgent and pressing need to optimize usage of Graphical Processing Units (GPUs), which have arguably become one of the most expensive and sought after IT resources. To help with this goal, several of the current generation of…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-11 Bekir Turkkan , Pavankumar Murali , Pavithra Harsha , Rohan Arora , Gerard Vanloo , Chandra Narayanaswami

Intra-node Memory Safe GPU Co-Scheduling

GPUs in High-Performance Computing systems remain under-utilised due to the unavailability of schedulers that can safely schedule multiple applications to share the same GPU. The research reported in this paper is motivated to improve the…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-12-14 Carlos Reano , Federico Silla , Dimitrios S. Nikolopoulos , Blesson Varghese

Techniques for Shared Resource Management in Systems with Throughput Processors

The continued growth of the computational capability of throughput processors has made throughput processors the platform of choice for a wide variety of high performance computing applications. Graphics Processing Units (GPUs) are a prime…

Hardware Architecture · Computer Science 2018-05-01 Rachata Ausavarungnirun

Taming GPU Underutilization via Static Partitioning and Fine-grained CPU Offloading

Advances in GPU compute throughput and memory capacity brings significant opportunities to a wide range of workloads. However, efficiently utilizing these resources remains challenging, particularly because diverse application…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-10 Gabin Schieffer , Ruimin Shi , Jie Ren , Ivy Peng

GPUs as Storage System Accelerators

Massively multicore processors, such as Graphics Processing Units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-18 Samer Al-Kiswany , Abdullah Gharaibeh , Matei Ripeanu

GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations

Collocating deep learning training tasks improves GPU utilization but risks resource contention, severe slowdowns, and out-of-memory (OOM) failures. Accurate memory estimation is essential for robust collocation, and GPU utilization…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-29 Ehsan Yousefzadeh-Asl-Miandoab , Reza Karimzadeh , Danyal Yorulmaz , Bulat Ibragimov , Pınar Tözün

Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

In modern GPU inference, cache efficiency remains a major bottleneck, and heuristic policies such as \textsc{LRU} can perform far worse than the offline optimum. Existing learning-based caching systems improve hit rates mainly through…

Machine Learning · Computer Science 2026-04-27 Peng Chen , Jiaji Zhang , Hailiang Zhao , Yirong Zhang , Shenyao Chen , Jiahong Yu , Xueyan Tang , Yixuan Wang , Hao Li , Jianping Zou , Gang Xiong , Kingsum Chow , Shuibing He , Shuiguang Deng

Cache Optimization for Memory Intensive Workloads on Multi-socket Multi-core servers

Major chip manufacturers have all introduced multicore microprocessors. Multi-socket systems built from these processors are used for running various server applications. Depending on the application that is run on the system, remote memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-08-13 Murthy Durbhakula

Performance Analysis and Efficient Execution on Systems with multi-core CPUs, GPUs and MICs

We carry out a comparative performance study of multi-core CPUs, GPUs and Intel Xeon Phi (Many Integrated Core - MIC) with a microscopy image analysis application. We experimentally evaluate the performance of computing devices on core…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-05-15 George Teodoro , Tahsin Kurc , Guilherme Andrade , Jun Kong , Renato Ferreira , Joel Saltz

Predictive Modeling of I/O Performance for Machine Learning Training Pipelines: A Data-Driven Approach to Storage Optimization

Modern machine learning training is increasingly bottlenecked by data I/O rather than compute. GPUs often sit idle at below 50% utilization waiting for data. This paper presents a machine learning approach to predict I/O performance and…

Performance · Computer Science 2025-12-22 Karthik Prabhakar , Durgamadhab Mishra

Revisiting Query Performance in GPU Database Systems

GPUs offer massive compute parallelism and high-bandwidth memory accesses. GPU database systems seek to exploit those capabilities to accelerate data analytics. Although modern GPUs have more resources (e.g., higher DRAM bandwidth) than…

Databases · Computer Science 2023-02-03 Jiashen Cao , Rathijit Sen , Matteo Interlandi , Joy Arulraj , Hyesoon Kim

Optimizing SSD Caches for Cloud Block Storage Systems Using Machine Learning Approaches

The growing demand for efficient cloud storage solutions has led to the widespread adoption of Solid-State Drives (SSDs) for caching in cloud block storage systems. The management of data writes to SSD caches plays a crucial role in…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-30 Chiyu Cheng , Chang Zhou , Yang Zhao , Jin Cao

Recent Advances in Overcoming Bottlenecks in Memory Systems and Managing Memory Resources in GPU Systems

This article features extended summaries and retrospectives of some of the recent research done by our research group, SAFARI, on (1) various critical problems in memory systems and (2) how memory system bottlenecks affect graphics…

Hardware Architecture · Computer Science 2018-05-30 Onur Mutlu , Saugata Ghose , Rachata Ausavarungnirun

An Analysis of Collocation on GPUs for Deep Learning Training

Deep learning training is an expensive process that extensively uses GPUs, but not all model training saturates modern powerful GPUs. Multi-Instance GPU (MIG) is a new technology introduced by NVIDIA that can partition a GPU to better-fit…

Machine Learning · Computer Science 2023-04-25 Ties Robroek , Ehsan Yousefzadeh-Asl-Miandoab , Pınar Tözün

Characterizing the impact of last-level cache replacement policies on big-data workloads

In recent years, graph-processing has become an essential class of workloads with applications in a rapidly growing number of fields. Graph-processing typically uses large input sets, often in multi-gigabyte scale, and data-dependent graph…

Hardware Architecture · Computer Science 2025-10-24 Alexandre Valentin Jamet , Lluc Alvarez , Marc Casas

A Graph-Partition-Based Scheduling Policy for Heterogeneous Architectures

In order to improve system performance efficiently, a number of systems choose to equip multi-core and many-core processors (such as GPUs). Due to their discrete memory these heterogeneous architectures comprise a distributed system within…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-02-27 Hao Wu , Daniel Lohmann , Wolfgang Schröder-Preikschat