Related papers: LRC: Dependency-Aware Cache Management for Data An…
Memory caches are being aggressively used in today's data-parallel frameworks such as Spark, Tez and Storm. By caching input and intermediate data in memory, compute tasks can witness speedup by orders of magnitude. To maximize the chance…
This paper presents a comprehensive comparison of distributed caching algorithms employed in modern distributed systems. We evaluate various caching strategies including Least Recently Used (LRU), Least Frequently Used (LFU), Adaptive…
In the era of big data and cloud computing, large amounts of data are generated from user applications and need to be processed in the datacenter. Data-parallel computing frameworks, such as Apache Spark, are widely used to perform such…
Efficient edge caching reduces latency and alleviates backhaul congestion in modern networks. Traditional caching policies, such as Least Recently Used (LRU) and Least Frequently Used (LFU), perform well under specific request patterns. LRU…
The scaling of Large Language Model (LLM) services faces significant cost and latency challenges, making effective caching under tight capacity crucial. Existing cache replacement policies, from heuristics to learning-based methods,…
Caching systems using the Least Recently Used (LRU) principle have now become ubiquitous. A fundamental question for these systems is whether the cache space should be pooled together or divided to serve multiple flows of data item requests…
Modern processors use cache memory: a memory access that "hits" the cache returns early, while a "miss" takes more time. Given a memory access in a program, cache analysis consists in deciding whether this access is always a hit, always a…
Last-Level Cache (LLC) represents the bulk of a modern CPU processor's transistor budget and is essential for application performance as LLC enables fast access to data in contrast to much slower main memory. However, applications with…
Caching plays a crucial role in networking systems to reduce the load on the network and is commonly employed by content delivery networks (CDNs) in order to improve performance. One of the commonly used mechanisms, Least Recently Used…
Prompt caching is critical for reducing latency and cost in LLM inference: OpenAI and Anthropic report up to 50-90% cost savings through prompt reuse. Despite its widespread success, little is known about what constitutes an optimal prompt…
While the cost of computation is an easy to understand local property, the cost of data movement on cached architectures depends on global state, does not compose, and is hard to predict. As a result, programmers often fail to consider the…
This article introduces a novel family of decentralised caching policies, applicable to wireless networks with finite storage at the edge-nodes (stations). These policies, that are based on the Least-Recently-Used replacement principle, are…
Recent advances in Large Language Model (LLM)-based agents have been propelled by Retrieval-Augmented Generation (RAG), which grants the models access to vast external knowledge bases. Despite RAG's success in improving agent performance,…
Adaptive Replacement Cache (ARC) and CLOCK with Adaptive Replacement (CAR) are state-of-the- art "adaptive" cache replacement algorithms invented to improve on the shortcomings of classical cache replacement policies such as LRU, LFU and…
For applications in worst-case execution time analysis and in security, it is desirable to statically classify memory accesses into those that result in cache hits, and those that result in cache misses. Among cache replacement policies,…
Cache persistence analysis is an important part of worst-case execution time (WCET) analysis. It has been extensively studied in the past twenty years. Despite these efforts, all existing persistence analyses are approximative in the sense…
Current day processors employ multi-level cache hierarchy with one or two levels of private caches and a shared last-level cache (LLC). An efficient cache replacement policy at LLC is essential for reducing the off-chip memory transfer as…
It is generally observed that the fraction of live lines in shared last-level caches (SLLC) is very small for chip multiprocessors (CMPs). This can be tackled using promotion-based replacement policies like re-reference interval prediction…
In modern GPU inference, cache efficiency remains a major bottleneck, and heuristic policies such as \textsc{LRU} can perform far worse than the offline optimum. Existing learning-based caching systems improve hit rates mainly through…
DRAM-based memory is a critical factor that creates a bottleneck on the system performance since the processor speed largely outperforms the DRAM latency. In this thesis, we develop a low-cost mechanism, called ChargeCache, which enables…