Related papers: Exposing Shadow Branches
Prior work has observed that fetch-directed prefetching (FDIP) is highly effective at covering instruction cache misses. The key to FDIP's effectiveness is having a sufficiently large BTB to accommodate the application's branch working set.…
High-performance branch target buffers (BTBs) and the L1I cache are key to high-performance front-end. Modern branch predictors are highly accurate, but with an increase in code footprint in modern-day server workloads, BTB and L1I misses…
Many contemporary applications feature multi-megabyte instruction footprints that overwhelm the capacity of branch target buffers (BTB) and instruction caches (L1-I), causing frequent front-end stalls that inevitably hurt performance. BTB…
Modern processors have suffered a deluge of threats exploiting branch instruction collisions inside the branch prediction unit (BPU), from eavesdropping on secret-related branch operations to triggering malicious speculative executions.…
Load-Dependent Branches (LDB) often do not exhibit regular patterns in their local or global history and thus are inherently hard to predict correctly by conventional branch predictors. We propose a software-to-hardware branch…
Modern out-of-order CPUs heavily rely on speculative execution for performance optimization, with branch prediction serving as a cornerstone to minimize stalls and maximize efficiency. Whenever shared branch prediction resources lack proper…
Efficiency in instruction fetching is critical to performance, and this requires the primary structures--L1 instruction caches (L1i), branch target buffers (BTB) and instruction TLBs (iTLB)--to have the requisite information when needed.…
Modern processors rely heavily on speculation to keep the pipeline filled and consequently execute and commit instructions as close to maximum capacity as possible. To improve instruction-level parallelism, the processor core needs to fetch…
Modern storage systems intensively utilize data prefetching algorithms while processing sequences of the read requests. Performance of the prefetching algorithm (for instance increase of the cache hit ratio of the cache system - CHR)…
Recently, speculative decoding (SD) has emerged as a promising technique to accelerate LLM inference by employing a small draft model to propose draft tokens in advance, and validating them in parallel with the large target model. However,…
Transient execution attacks that exploit speculation have raised significant concerns in computer systems. Typically, branch predictors are leveraged to trigger mis-speculation in transient execution attacks. In this work, we demonstrate a…
L1 instruction (L1-I) cache misses are a source of performance bottleneck. Sequential prefetchers are simple solutions to mitigate this problem; however, prior work has shown that these prefetchers leave considerable potentials uncovered.…
Feature caching has recently emerged as a promising method for diffusion model acceleration. It effectively alleviates the inefficiency problem caused by high computational requirements by caching similar features in the inference process…
Branch predictor (BP) is an essential component in modern processors since high BP accuracy can improve performance and reduce energy by decreasing the number of instructions executed on wrong-path. However, reducing latency and storage…
Branch prediction is key to the performance of out-of-order processors. While the CBP-2016 winner TAGE-SC-L combines geometric-history tables, a statistical corrector, and a loop predictor, over half of its remaining mispredictions stem…
The growing memory footprints of cloud and big data applications mean that data center CPUs can spend significant time waiting for memory. An attractive approach to improving performance in such centralized compute settings is to employ…
Cache prefetcher greatly eliminates compulsory cache misses, by fetching data from slower memory to faster cache before it is actually required by processors. Sophisticated prefetchers predict next use cache line by repeating program's…
FDTD codes, such as Sophie developed at CEA/DAM, no longer take advantage of the processor's increased computing power, especially recently with the raising multicore technology. This is rooted in the fact that low order numerical schemes…
Large-scale networked services rely on deep soft-ware stacks and microservice orchestration, which increase instruction footprints and create frontend stalls that inflate tail latency and energy. We revisit instruction prefetching for these…
Modern processor designs use a variety of microarchitectural methods to achieve high performance. Unfortunately, new side-channels have often been uncovered that exploit these enhanced designs. One area that has received little attention…