Reducing Load Latency with Cache Level Prediction

Majid Jalili; Mattan Erez

Reducing Load Latency with Cache Level Prediction

Hardware Architecture 2021-03-30 v1

Authors: Majid Jalili , Mattan Erez

Abstract

High load latency that results from deep cache hierarchies and relatively slow main memory is an important limiter of single-thread performance. Data prefetch helps reduce this latency by fetching data up the hierarchy before it is requested by load instructions. However, data prefetching has shown to be imperfect in many situations. We propose cache-level prediction to complement prefetchers. Our method predicts which memory hierarchy level a load will access allowing the memory loads to start earlier, and thereby saves many cycles. The predictor provides high prediction accuracy at the cost of just one cycle added latency to L1 misses. Experimental results show speedup of 7.8\% on generic, graph, and HPC applications over a baseline with aggressive prefetchers.

Keywords

large language model inference key-value cache coded caching

Cite

@article{arxiv.2103.14808,
  title  = {Reducing Load Latency with Cache Level Prediction},
  author = {Majid Jalili and Mattan Erez},
  journal= {arXiv preprint arXiv:2103.14808},
  year   = {2021}
}

Related papers

View all related →