Related papers: Phase-Priority based Directory Coherence for Multi…
Phase-change memory (PCM) devices have multiple banks to serve memory requests in parallel. Unfortunately, if two requests go to the same bank, they have to be served one after another, leading to lower system performance. We observe that a…
The use of multi-chip modules (MCM) and/or multi-socket boards is the most suitable approach to increase the computation density of servers while keep chip yield attained. This paper introduces a new coherence protocol suitable, in terms of…
The \emph{Partial Cache-Coherence (PCC)} model maintains hardware cache coherence only within subsets of cores, enabling large-scale memory sharing with emerging memory interconnect technologies like Compute Express Link (CXL). However,…
Reducing the average memory access time is crucial for improving the performance of applications running on multi-core architectures. With workload consolidation this becomes increasingly challenging due to shared resource contention.…
In this paper, we propose a concurrency control protocol, called the Prudent-Precedence Concurrency Control (PPCC) protocol, for high data contention database environments. PPCC is prudently more aggressive in permitting more serializable…
Real-time and cyber-physical systems need to interact with and respond to their physical environment in a predictable time. While multicore platforms provide incredible computational power and throughput, they also introduce new sources of…
In modern Commercial Off-The-Shelf (COTS) multicore systems, each core can generate many parallel memory requests at a time. The processing of these parallel requests in the DRAM controller greatly affects the memory interference delay…
Most commercial embedded devices have been deployed with a single processor architecture. The code size and complexity of applications running on embedded devices are rapidly increasing due to the emergence of application business models…
Cache partitioning techniques have been successfully adopted to mitigate interference among concurrently executing real-time tasks on multi-core processors. Considering that the execution time of a cache-sensitive task strongly depends on…
Processing-in-memory (PIM) architectures have seen an increase in popularity recently, as the high internal bandwidth available within 3D-stacked memory provides greater incentive to move some computation into the logic layer of the memory.…
General trends in computer architecture are shifting more towards parallelism. Multicore architectures have proven to be a major step in processor evolution. With the advancement in multicore architecture, researchers are focusing on…
Comprehending the performance bottlenecks at the core of the intricate hardware-software interactions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatically asynchronous MPI…
Unlike traditional PCIe-based FPGA accelerators, heterogeneous SoC-FPGA devices provide tighter integrations between software running on CPUs and hardware accelerators. Modern heterogeneous SoC-FPGA platforms support multiple I/O cache…
Advancements in multi-core have created interest among many research groups in finding out ways to harness the true power of processor cores. Recent research suggests that on-board component such as cache memory plays a crucial role in…
Modern commercial-off-the-shelf (COTS) multicore processors have advanced memory hierarchies that enhance memory-level parallelism (MLP), which is crucial for high performance. To support high MLP, shared last-level caches (LLCs) are…
Modern computer designs support composite prefetching, where multiple individual prefetcher components are used to target different memory access patterns. However, multiple prefetchers competing for resources can drastically hurt…
Conventional cache models are not suited for real-time parallel processing because tasks may flush each other's data out of the cache in an unpredictable manner. In this way the system is not compositional so the overall performance is…
Directory-based protocols have been the de facto solution for maintaining cache coherence in shared-memory parallel systems comprising multi/many cores, where each store instruction is eagerly made globally visible by invalidating the…
Conventional wisdom holds that an efficient interface between an OS running on a CPU and a high-bandwidth I/O device should use Direct Memory Access (DMA) to offload data transfer, descriptor rings for buffering and queuing, and interrupts…
We consider a parallel computational model that consists of $P$ processors, each with a fast local ephemeral memory of limited size, and sharing a large persistent memory. The model allows for each processor to fault with bounded…