Related papers: On Memory Codelets: Prefetching, Recoding, Moving …
The explosion in workload complexity and the recent slow-down in Moore's law scaling call for new approaches towards efficient computing. Researchers are now beginning to use recent advances in machine learning in software optimizations,…
Recently, hardware technology has rapidly evolved pertaining to domain-specific applications/architectures. Soon, processors may be composed of a large collection of vendor-independent IP specialized for application-specific algorithms,…
Caches only exploit spatial and temporal locality in a set of address referenced in a program. Due to dynamic construction of linked data-structures, they are difficult to cache as the spatial locality between the nodes is highly dependent…
The memory subsystem has always been a bottleneck in performance as well as significant power contributor in memory intensive applications. Many researchers have presented multi-layered memory hierarchies as a means to design energy and…
Modern high-performance architectures employ large last-level caches (LLCs). While large LLCs can reduce average memory access latency for workloads with a high degree of locality, they can also increase latency for workloads with irregular…
Accurate memory prefetching is paramount for processor performance, and modern processors employ various techniques to identify and prefetch different memory access patterns. While most modern prefetchers target spatio-temporal patterns by…
Hardware based memory pooling enabled by interconnect standards like CXL have been gaining popularity amongst cloud providers and system integrators. While pooling memory resources has cost benefits, it comes at a penalty of increased…
Data prefetching, i.e., the act of predicting application's future memory accesses and fetching those that are not in the on-chip caches, is a well-known and widely-used approach to hide the long latency of memory accesses. The fruitfulness…
Memory latencies and bandwidth are major factors, limiting system performance and scalability. Modern CPUs aim at hiding latencies by employing large caches, out-of-order execution, or complex hardware prefetchers. However, software-based…
We consider a cache network in which a single server is connected to multiple users via a shared error free link. The server has access to a database with $N$ files of equal length $F$, and serves $K$ users each with a cache memory of $MF$…
Cache prefetcher greatly eliminates compulsory cache misses, by fetching data from slower memory to faster cache before it is actually required by processors. Sophisticated prefetchers predict next use cache line by repeating program's…
Modern prefetchers identify memory access patterns in order to predict future accesses. However, many applications exhibit irregular access patterns that do not manifest spatio-temporal locality in the memory address space. Such…
Machine learning algorithms have shown potential to improve prefetching performance by accurately predicting future memory accesses. Existing approaches are based on the modeling of text prediction, considering prefetching as a…
Network switches and routers need to serve packet writes and reads at rates that challenge the most advanced memory technologies. As a result, scaling the switching rates is commonly done by parallelizing the packet I/Os using multiple…
Emerging applications, such as big data analytics and machine learning, require increasingly large amounts of main memory, often exceeding the capacity of current commodity processors built on DRAM technology. To address this, recent…
Large language models (LLMs) are typically served from clusters of GPUs/NPUs that consist of large number of devices. Unfortunately, communication between these devices incurs significant overhead, increasing the inference latency and cost…
Unified Virtual Memory (UVM) relieves the developers from the onus of maintaining complex data structures and explicit data migration by enabling on-demand data movement between CPU memory and GPU memory. However, on-demand paging soon…
Computing has a huge memory problem. The memory system, consisting of multiple technologies at different levels, is responsible for most of the energy consumption, performance bottlenecks, robustness problems, monetary cost, and hardware…
Multimodal large language models (MLLMs) have recently demonstrated strong capabilities in understanding and generating responses from diverse visual inputs, including high-resolution images and long video sequences. As these models scale…
Data Prefetching is a technique that can hide memory latency by fetching data before it is needed by a program. Prefetching relies on accurate memory access prediction, to which task machine learning based methods are increasingly applied.…