Related papers: Understanding GPU Resource Interference One Level …
In order to satisfy timing constraints, modern real-time applications require massively parallel accelerators such as General Purpose Graphic Processing Units (GPGPUs). Generation after generation, the number of computing clusters made…
Collocating deep learning training tasks improves GPU utilization but risks resource contention, severe slowdowns, and out-of-memory (OOM) failures. Accurate memory estimation is essential for robust collocation, and GPU utilization…
Modern computing platforms tend to deploy multiple GPUs (2, 4, or more) on a single node to boost system performance, with each GPU having a large capacity of global memory and streaming multiprocessors (SMs). GPUs are an expensive…
The continued growth of the computational capability of throughput processors has made throughput processors the platform of choice for a wide variety of high performance computing applications. Graphics Processing Units (GPUs) are a prime…
GPUs in High-Performance Computing systems remain under-utilised due to the unavailability of schedulers that can safely schedule multiple applications to share the same GPU. The research reported in this paper is motivated to improve the…
There is growing interest in accelerating irregular data-parallel algorithms on GPUs. These algorithms are typically blocking, so they require fair scheduling. But GPU programming models (e.g.\ OpenCL) do not mandate fair scheduling, and…
Machine learning (ML) inference serving systems can schedule requests to improve GPU utilization and to meet service level objectives (SLOs) or deadlines. However, improving GPU utilization may compromise latency-sensitive scheduling, as…
In recent years, machine intelligence (MI) applications have emerged as a major driver for the computing industry. Optimizing these workloads is important but complicated. As memory demands grow and data movement overheads increasingly…
Efficient scheduling of distributed deep learning (DL) jobs in large GPU clusters is crucial for resource efficiency and job performance. While server sharing among jobs improves resource utilization, interference among co-located DL jobs…
There is an urgent and pressing need to optimize usage of Graphical Processing Units (GPUs), which have arguably become one of the most expensive and sought after IT resources. To help with this goal, several of the current generation of…
Large-scale machine learning workloads increasingly rely on multi-GPU systems, yet their performance is often limited by an overlooked component: the CPU. Through a detailed study of modern large language model (LLM) inference and serving…
Recent advancements in Large Language Models (LLMs) have led to increasingly diverse requests, accompanied with varying resource (compute and memory) demands to serve them. However, this in turn degrades the cost-efficiency of LLM serving…
GPUs are essential to accelerating the latency-sensitive deep neural network (DNN) inference workloads in cloud datacenters. To fully utilize GPU resources, spatial sharing of GPUs among co-located DNN inference workloads becomes…
Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU…
This article features extended summaries and retrospectives of some of the recent research done by our research group, SAFARI, on (1) various critical problems in memory systems and (2) how memory system bottlenecks affect graphics…
Efficient LLM inference on resource-constrained devices presents significant challenges in compute and memory utilization. Due to limited GPU memory, existing systems offload model weights to CPU memory, incurring substantial I/O overhead…
Contemporary GPUs allow concurrent execution of small computational kernels in order to prevent idling of GPU resources. Despite the potential concurrency between independent kernels, the order in which kernels are issued to the GPU will…
GPU singletasking is becoming increasingly inefficient and unsustainable as hardware capabilities grow and workloads diversify. We are now at an inflection point where GPUs must embrace multitasking, much like CPUs did decades ago, to meet…
Modern GPU workloads increasingly demand efficient resource sharing, as many jobs do not require the full capacity of a GPU. Among sharing techniques, NVIDIA's Multi-Instance GPU (MIG) offers strong resource isolation by enabling…
Advances in GPU compute throughput and memory capacity brings significant opportunities to a wide range of workloads. However, efficiently utilizing these resources remains challenging, particularly because diverse application…