Related papers: Exploring DRAM Cache Prefetching for Pooled Memory
Current HPC systems provide memory resources that are statically configured and tightly coupled with compute nodes. However, workloads on HPC systems are evolving. Diverse workloads lead to a need for configurable memory resources to…
Advancement in Processor technology has made it easy to handle data-intensive workloads, but limiting main memory advances has created performance bottlenecks. In DRAM, there have been improvements in DRAM access latency as well as…
Emerging applications, such as big data analytics and machine learning, require increasingly large amounts of main memory, often exceeding the capacity of current commodity processors built on DRAM technology. To address this, recent…
Caching is crucial for enabling high-throughput networks for data intensive applications. Traditional caching technology relies on DRAM, as it can transfer data at a high rate. However, DRAM capacity is subject to contention by most system…
Memory latencies and bandwidth are major factors, limiting system performance and scalability. Modern CPUs aim at hiding latencies by employing large caches, out-of-order execution, or complex hardware prefetchers. However, software-based…
Integrating compute express link (CXL) with SSDs allows scalable access to large memory but has slower speeds than DRAMs. We present ExPAND, an expander-driven CXL prefetcher that offloads last-level cache (LLC) prefetching from host CPU to…
The growing prevalence of data-intensive workloads, such as artificial intelligence (AI), machine learning (ML), high-performance computing (HPC), in-memory databases, and real-time analytics, has exposed limitations in conventional memory…
Public cloud providers seek to meet stringent performance requirements and low hardware cost. A key driver of performance and cost is main memory. Memory pooling promises to improve DRAM utilization and thereby reduce costs. However,…
Memory disaggregation has recently been adopted in data centers to improve resource utilization, motivated by cost and sustainability. Recent studies on large-scale HPC facilities have also highlighted memory underutilization. A promising…
Data prefetching, i.e., the act of predicting application's future memory accesses and fetching those that are not in the on-chip caches, is a well-known and widely-used approach to hide the long latency of memory accesses. The fruitfulness…
Cache prefetcher greatly eliminates compulsory cache misses, by fetching data from slower memory to faster cache before it is actually required by processors. Sophisticated prefetchers predict next use cache line by repeating program's…
Modern high-performance architectures employ large last-level caches (LLCs). While large LLCs can reduce average memory access latency for workloads with a high degree of locality, they can also increase latency for workloads with irregular…
CXL has been the emerging technology for expanding memory for both the host CPU and device accelerators with load/store interface. Extending memory coherency to the PCIe root complex makes the codesign more flexible in that you can access…
High load latency that results from deep cache hierarchies and relatively slow main memory is an important limiter of single-thread performance. Data prefetch helps reduce this latency by fetching data up the hierarchy before it is…
Reducing the average memory access time is crucial for improving the performance of applications running on multi-core architectures. With workload consolidation this becomes increasingly challenging due to shared resource contention.…
CXLMemSim is a fast, lightweight simulation framework that enables performance characterization of memory systems based on Compute Express Link (CXL) .mem technology. CXL.mem allows disaggregation and pooling of memory to mitigate memory…
Modern distributed file systems rely on uncoordinated, per node page caches that replicate hot data locally across the cluster. While ensuring fast local access, this architecture underutilizes aggregate cluster DRAM capacity through…
Memory resources in data centers generally suffer from low utilization and lack of dynamics. Memory disaggregation solves these problems by decoupling CPU and memory, which currently includes approaches based on RDMA or interconnection…
In the landscape of High-Performance Computing (HPC), the quest for efficient and scalable memory solutions remains paramount. The advent of Compute Express Link (CXL) introduces a promising avenue with its potential to function as a…
Compute Express Link (CXL) emerges as a solution for wide gap between computational speed and data communication rates among host and multiple devices. It fosters a unified and coherent memory space between host and CXL storage devices such…