Related papers: Compact NUMA-Aware Locks

Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models

Developing concurrent software is challenging, especially if it has to run on modern architectures with Weak Memory Models (WMMs) such as ARMv8, Power, or RISC-V. For the sake of performance, WMMs allow hardware and compilers to…

Operating Systems · Computer Science 2022-07-12 Antonio Paolillo , Hernán Ponce-de-León , Thomas Haas , Diogo Behrens , Rafael Chehab , Ming Fu , Roland Meyer

Taking the Leap: Efficient and Reliable Fine-Grained NUMA Migration in User-space

Modern multi-socket architectures offer a single virtual address space, but physically divide main-memory across multiple regions, where each region is attached to a CPU and its cores. While this simplifies the usage, developers must be…

Databases · Computer Science 2026-02-06 Felix Schuhknecht , Nick Rassau

JArena: Partitioned Shared Memory for NUMA-awareness in Multi-threaded Scientific Applications

The distributed shared memory (DSM) architecture is widely used in today's computer design to mitigate the ever-widening processing-memory gap, and inevitably exhibits non-uniform memory access (NUMA) to shared-memory parallel applications.…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-02-21 Zhang Yang , Aiqing Zhang , Zeyao Mo

Basic Lock Algorithms in Lightweight Thread Environments

Traditionally, multithreaded data structures have been designed for access by the threads of Operating Systems (OS). However, implementations for access by programmable alternatives known as lightweight threads (also referred to as…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-10 Taras Skazhenik , Nikolai Korobenikov , Andrei Churbanov , Anton Malakhov , Vitaly Aksenov

Reciprocating Locks

We present "Reciprocating Locks", a novel mutual exclusion locking algorithm, targeting cache-coherent shared memory (CC), that enjoys a number of desirable properties. The doorway arrival phase and the release operation both run in…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-14 Dave Dice , Alex Kogan

Mutable Locks: Combining the Best of Spin and Sleep Locks

In this article we present Mutable Locks, a synchronization construct with the same execution semantic of traditional locks (such as spin locks or sleep locks), but with a self-tuned optimized trade off between responsiveness---in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-02 Romolo Marotta , Davide Tiriticco , Pierangelo Di Sanzo , Alessandro Pellegrini , Bruno Ciciani , Francesco Quaglia

clusterNOR: A NUMA-Optimized Clustering Framework

Clustering algorithms are iterative and have complex data access patterns that result in many small random memory accesses. The performance of parallel implementations suffer from synchronous barriers for each iteration and skewed…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-01-19 Disa Mhembere , Da Zheng , Carey E. Priebe , Joshua T. Vogelstein , Randal Burns

Scalable Range Locks for Scalable Address Spaces and Beyond

Range locks are a synchronization construct designed to provide concurrent access to multiple threads (or processes) to disjoint parts of a shared resource. Originally conceived in the file system context, range locks are gaining increasing…

Operating Systems · Computer Science 2020-06-23 Alex Kogan , Dave Dice , Shady Issa

MECHA: Multithreaded and Efficient Cryptographic Hardware Access

This paper presents a multithread and efficient cryptographic hardware access (MECHA) for efficient and fast cryptographic operations that eliminates the need for context switching. Utilizing a UNIX domain socket, MECHA manages multiple…

Cryptography and Security · Computer Science 2025-06-19 Pratama Derry , Laksmono Agus Mahardika Ari , Iqbal Muhammad , Howon Kim

Towards Efficient OpenMP Strategies for Non-Uniform Architectures

Parallel processing is considered as todays and future trend for improving performance of computers. Computing devices ranging from small embedded systems to big clusters of computers rely on parallelizing applications to reduce execution…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-11-27 Oussama Tahan

Fissile Locks

Classic test-and-test (TS) mutual exclusion locks are simple, and enjoy high performance and low latency of ownership transfer under light or no contention. However, they do not scale gracefully under high contention and do not provide any…

Operating Systems · Computer Science 2020-05-05 Dave Dice , Alex Kogan

Learning-based Dynamic Pinning of Parallelized Applications in Many-Core Systems

Motivated by the need for adaptive, secure and responsive scheduling in a great range of computing applications, including human-centered and time-critical applications, this paper proposes a scheduling framework that seamlessly adds…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-14 Georgios C. Chasparis , Vladimir Janjic , Michael Rossbory

ALock: Asymmetric Lock Primitive for RDMA Systems

Remote direct memory access (RDMA) networks are being rapidly adopted into industry for their high speed, low latency, and reduced CPU overheads compared to traditional kernel-based TCP/IP networks. RDMA enables threads to access remote…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-30 Amanda Baran , Jacob Nelson-Slivon , Lewis Tseng , Roberto Palmieri

DecLock: A Case of Decoupled Locking for Disaggregated Memory

This paper reveals that locking can significantly degrade the performance of applications on disaggregated memory (DM), sometimes by several orders of magnitude, due to contention on the NICs of memory nodes (MN-NICs). To address this…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-26 Hanze Zhang , Ke Cheng , Rong Chen , Xingda Wei , Haibo Chen

Hemlock : Compact and Scalable Mutual Exclusion

We present Hemlock, a novel mutual exclusion locking algorithm that is extremely compact, requiring just one word per thread plus one word per lock, but which still provides local spinning in most circumstances, high throughput under…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-01-07 Dave Dice , Alex Kogan

SmartPQ: An Adaptive Concurrent Priority Queue for NUMA Architectures

Concurrent priority queues are widely used in important workloads, such as graph applications and discrete event simulations. However, designing scalable concurrent priority queues for NUMA architectures is challenging. Even though several…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-12 Christina Giannoula , Foteini Strati , Dimitrios Siakavaras , Georgios Goumas , Nectarios Koziris

Lotus: Optimizing Disaggregated Transactions with Disaggregated Locks

Disaggregated memory (DM) separates compute and memory resources, allowing flexible scaling to achieve high resource utilization. To ensure atomic and consistent data access on DM, distributed transaction systems have been adapted, where…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-19 Zhisheng Hu , Pengfei Zuo , Junliang Hu , Yizou Chen , Yingjia Wang , Ming-Chang Yang

Avoiding Scalability Collapse by Restricting Concurrency

Saturated locks often degrade the performance of a multithreaded application, leading to a so-called scalability collapse problem. This problem arises when a growing number of threads circulating through a saturated lock causes the overall…

Operating Systems · Computer Science 2019-07-15 Dave Dice , Alex Kogan

Improving the scalabiliy of neutron cross-section lookup codes on multicore NUMA system

We use the XSBench proxy application, a memory-intensive OpenMP program, to explore the source of on-node scalability degradation of a popular Monte Carlo (MC) reactor physics benchmark on non-uniform memory access (NUMA) systems. As…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-10 Kazutomo Yoshii , John Tramm , Andrew Siegel , Pete Beckman

New Thread Migration Strategies for NUMA Systems

Multicore systems present on-board memory hierarchies and communication networks that influence performance when executing shared memory parallel codes. Characterising this influence is complex, and understanding the effect of particular…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-10-01 O. G. Lorenzo , M. L. Becoña , T. F. Pena , J. C. Cabaleiro , J. A. Lorenzo , F. F. Rivera