Related papers: Compact NUMA-Aware Locks
Developing concurrent software is challenging, especially if it has to run on modern architectures with Weak Memory Models (WMMs) such as ARMv8, Power, or RISC-V. For the sake of performance, WMMs allow hardware and compilers to…
Modern multi-socket architectures offer a single virtual address space, but physically divide main-memory across multiple regions, where each region is attached to a CPU and its cores. While this simplifies the usage, developers must be…
The distributed shared memory (DSM) architecture is widely used in today's computer design to mitigate the ever-widening processing-memory gap, and inevitably exhibits non-uniform memory access (NUMA) to shared-memory parallel applications.…
Traditionally, multithreaded data structures have been designed for access by the threads of Operating Systems (OS). However, implementations for access by programmable alternatives known as lightweight threads (also referred to as…
We present "Reciprocating Locks", a novel mutual exclusion locking algorithm, targeting cache-coherent shared memory (CC), that enjoys a number of desirable properties. The doorway arrival phase and the release operation both run in…
In this article we present Mutable Locks, a synchronization construct with the same execution semantic of traditional locks (such as spin locks or sleep locks), but with a self-tuned optimized trade off between responsiveness---in the…
Clustering algorithms are iterative and have complex data access patterns that result in many small random memory accesses. The performance of parallel implementations suffer from synchronous barriers for each iteration and skewed…
Range locks are a synchronization construct designed to provide concurrent access to multiple threads (or processes) to disjoint parts of a shared resource. Originally conceived in the file system context, range locks are gaining increasing…
This paper presents a multithread and efficient cryptographic hardware access (MECHA) for efficient and fast cryptographic operations that eliminates the need for context switching. Utilizing a UNIX domain socket, MECHA manages multiple…
Parallel processing is considered as todays and future trend for improving performance of computers. Computing devices ranging from small embedded systems to big clusters of computers rely on parallelizing applications to reduce execution…
Classic test-and-test (TS) mutual exclusion locks are simple, and enjoy high performance and low latency of ownership transfer under light or no contention. However, they do not scale gracefully under high contention and do not provide any…
Motivated by the need for adaptive, secure and responsive scheduling in a great range of computing applications, including human-centered and time-critical applications, this paper proposes a scheduling framework that seamlessly adds…
Remote direct memory access (RDMA) networks are being rapidly adopted into industry for their high speed, low latency, and reduced CPU overheads compared to traditional kernel-based TCP/IP networks. RDMA enables threads to access remote…
This paper reveals that locking can significantly degrade the performance of applications on disaggregated memory (DM), sometimes by several orders of magnitude, due to contention on the NICs of memory nodes (MN-NICs). To address this…
We present Hemlock, a novel mutual exclusion locking algorithm that is extremely compact, requiring just one word per thread plus one word per lock, but which still provides local spinning in most circumstances, high throughput under…
Concurrent priority queues are widely used in important workloads, such as graph applications and discrete event simulations. However, designing scalable concurrent priority queues for NUMA architectures is challenging. Even though several…
Disaggregated memory (DM) separates compute and memory resources, allowing flexible scaling to achieve high resource utilization. To ensure atomic and consistent data access on DM, distributed transaction systems have been adapted, where…
Saturated locks often degrade the performance of a multithreaded application, leading to a so-called scalability collapse problem. This problem arises when a growing number of threads circulating through a saturated lock causes the overall…
We use the XSBench proxy application, a memory-intensive OpenMP program, to explore the source of on-node scalability degradation of a popular Monte Carlo (MC) reactor physics benchmark on non-uniform memory access (NUMA) systems. As…
Multicore systems present on-board memory hierarchies and communication networks that influence performance when executing shared memory parallel codes. Characterising this influence is complex, and understanding the effect of particular…