Related papers: Deep Learning based Data Prefetching in CPU-GPU Un…

An Intelligent Framework for Oversubscription Management in CPU-GPU Unified Memory

This paper proposes a novel intelligent framework for oversubscription management in CPU-GPU UVM. We analyze the current rule-based methods of GPU memory oversubscription with unified memory, and the current learning-based methods for other…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-02-15 Xinjian Long , Xiangyang Gong , Huiyang Zhou

UVMBench: A Comprehensive Benchmark Suite for Researching Unified Virtual Memory in GPUs

The recent introduction of Unified Virtual Memory (UVM) in GPUs offers a new programming model that allows GPUs and CPUs to share the same virtual memory space, which shifts the complex memory management from programmers to GPU driver/…

Hardware Architecture · Computer Science 2020-10-22 Yongbin Gu , Wenxuan Wu , Yunfan Li , Lizhong Chen

GPUVM: GPU-driven Unified Virtual Memory

Graphics Processing Units (GPUs) leverage massive parallelism and large memory bandwidth to support high-performance computing applications, such as multimedia rendering, crypto-mining, deep learning, and natural language processing. These…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-11 Nurlan Nazaraliyev , Elaheh Sadredini , Nael Abu-Ghazaleh

Shared Virtual Memory: Its Design and Performance Implications for Diverse Applications

Discrete GPU accelerators, while providing massive computing power for supercomputers and data centers, have their separate memory domain. Explicit memory management across device and host domains in programming is tedious and error-prone.…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-14 Bennett Cooper , Thomas R. W. Scogland , Rong Ge

TransforMAP: Transformer for Memory Access Prediction

Data Prefetching is a technique that can hide memory latency by fetching data before it is needed by a program. Prefetching relies on accurate memory access prediction, to which task machine learning based methods are increasingly applied.…

Hardware Architecture · Computer Science 2022-05-31 Pengmiao Zhang , Ajitesh Srivastava , Anant V. Nori , Rajgopal Kannan , Viktor K. Prasanna

Fine-Grained Address Segmentation for Attention-Based Variable-Degree Prefetching

Machine learning algorithms have shown potential to improve prefetching performance by accurately predicting future memory accesses. Existing approaches are based on the modeling of text prediction, considering prefetching as a…

Hardware Architecture · Computer Science 2022-05-06 Pengmiao Zhang , Ajitesh Srivastava , Anant V. Nori , Rajgopal Kannan , Viktor K. Prasanna

Managing Hybrid Main Memories with a Page-Utility Driven Performance Model

Hybrid memory systems comprised of dynamic random access memory (DRAM) and non-volatile memory (NVM) have been proposed to exploit both the capacity advantage of NVM and the latency and dynamic energy advantages of DRAM. An important…

Hardware Architecture · Computer Science 2019-12-18 Yang Li , Jongmoo Choi , Jin Sun , Saugata Ghose , Hui Wang , Justin Meza , Jinglei Ren , Onur Mutlu

Data Cache Prefetching with Perceptron Learning

Cache prefetcher greatly eliminates compulsory cache misses, by fetching data from slower memory to faster cache before it is actually required by processors. Sophisticated prefetchers predict next use cache line by repeating program's…

Hardware Architecture · Computer Science 2017-12-05 Haoyuan Wang , Zhiwei Luo

Dissecting CPU-GPU Unified Physical Memory on AMD MI300A APUs

Discrete GPUs are a cornerstone of HPC and data center systems, requiring management of separate CPU and GPU memory spaces. Unified Virtual Memory (UVM) has been proposed to ease the burden of memory management; however, at a high cost in…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-14 Jacob Wahlgren , Gabin Schieffer , Ruimin Shi , Edgar A. León , Roger Pearce , Maya Gokhale , Ivy Peng

Learning Memory Access Patterns

The explosion in workload complexity and the recent slow-down in Moore's law scaling call for new approaches towards efficient computing. Researchers are now beginning to use recent advances in machine learning in software optimizations,…

Machine Learning · Computer Science 2018-03-16 Milad Hashemi , Kevin Swersky , Jamie A. Smith , Grant Ayers , Heiner Litz , Jichuan Chang , Christos Kozyrakis , Parthasarathy Ranganathan

UMDAM: A Unified Data Layout and DRAM Address Mapping for Heterogenous NPU-PIM

Large Language Models (LLMs) are increasingly deployed on edge devices with Neural Processing Units (NPUs), yet the decode phase remains memory-intensive, limiting performance. Processing-in-Memory (PIM) offers a promising solution, but…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-18 Hai Huang

xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads

The global scarcity of GPUs necessitates more sophisticated strategies for Deep Learning jobs in shared cluster environments. Accurate estimation of how much GPU memory a job will require is fundamental to enabling advanced scheduling and…

Performance · Computer Science 2025-10-27 Jiabo Shi , Dimitrios Pezaros , Yehia Elkhatib

Prefetching in Deep Memory Hierarchies with NVRAM as Main Memory

Emerging applications, such as big data analytics and machine learning, require increasingly large amounts of main memory, often exceeding the capacity of current commodity processors built on DRAM technology. To address this, recent…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-27 Manel Lurbe , Miguel Avargues , Salvador Petit , Maria E. Gomez , Rui Yang , Guanhao Wang , Julio Sahuquillo

HMM-V: Heterogeneous Memory Management for Virtualization

The memory demand of virtual machines (VMs) is increasing, while DRAM has limited capacity and high power consumption. Non-volatile memory (NVM) is an alternative to DRAM, but it has high latency and low bandwidth. We observe that the VM…

Operating Systems · Computer Science 2022-09-28 Sai sha , Chuandong Li , Yingwei Luo , Xiaolin Wang , Zhenlin Wang

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-27 Zhongyi Lin , Ning Sun , Pallab Bhattacharya , Xizhou Feng , Louis Feng , John D. Owens

Stochastic Modeling of Hybrid Cache Systems

In recent years, there is an increasing demand of big memory systems so to perform large scale data analytics. Since DRAM memories are expensive, some researchers are suggesting to use other memory systems such as non-volatile memory (NVM)…

Performance · Computer Science 2016-10-03 Gaoying Ju , Yongkun Li , Yinlong Xu , Jiqiang Chen , John C. S. Lui

Accelerating Graph Analytics on a Reconfigurable Architecture with a Data-Indirect Prefetcher

The irregular nature of memory accesses of graph workloads makes their performance poor on modern computing platforms. On manycore reconfigurable architectures (MRAs), in particular, even state-of-the-art graph prefetchers do not work well…

Hardware Architecture · Computer Science 2023-01-31 Yichen Yang , Jingtao Li , Nishil Talati , Subhankar Pal , Siying Feng , Chaitali Chakrabarti , Trevor Mudge , Ronald Dreslinski

CRUM: Checkpoint-Restart Support for CUDA's Unified Memory

Unified Virtual Memory (UVM) was recently introduced on recent NVIDIA GPUs. Through software and hardware support, UVM provides a coherent shared memory across the entire heterogeneous node, migrating data as appropriate. The older CUDA…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-08-02 Rohan Garg , Apoore Mohan , Michael Sullivan , Gene Cooperman

Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

Large language models (LLMs) are increasingly deployed on edge devices. To meet strict resource constraints, real-world deployment has pushed LLM quantization from 8-bit to 4-bit, 2-bit, and now 1.58-bit. Combined with lookup table…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-15 Xiangyu Li , Chengyu Yin , Weijun Wang , Jianyu Wei , Ting Cao , Yunxin Liu

PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving

Large language models (LLMs) are typically served from clusters of GPUs/NPUs that consist of large number of devices. Unfortunately, communication between these devices incurs significant overhead, increasing the inference latency and cost…

Artificial Intelligence · Computer Science 2025-05-27 Ahmet Caner Yüzügüler , Jiawei Zhuang , Lukas Cavigelli