Related papers: High-Performance Distributed RMA Locks
Designers of modern reader-writer locks confront a difficult trade-off related to reader scalability. Locks that have a compact memory representation for active readers will typically suffer under high intensity read-dominated workloads…
In this paper, we consider a hierarchical distributed multi-task learning (MTL) system where distributed users wish to jointly learn different models orchestrated by a central server with the help of a layer of multiple relays. Since the…
Despite the increasing adoption of Field-Programmable Gate Arrays (FPGAs) in compute clouds, there remains a significant gap in programming tools and abstractions which can leverage network-connected, cloud-scale, multi-die FPGAs to…
Triangle count and local clustering coefficient are two core metrics for graph analysis. They find broad application in analyses such as community detection and link recommendation. Current state-of-the-art solutions suffer from…
The relaxed semantics and rich functionality of one-sided communication primitives of MPI-3 makes MPI an attractive candidate for the implementation of PGAS models. However, the performance of such implementation suffers from the fact, that…
Distributed locking mechanisms are fundamental to ensuring data consistency and integrity in distributed systems. This paper presents a comprehensive analysis of distributed locking algorithms, focusing on their performance characteristics…
Designing a rate limiter that is simultaneously accurate, available, and scalable presents a fundamental challenge in distributed systems, primarily due to the trade-offs between algorithmic precision, availability, consistency, and…
Collective communication operations such as MPI_Alltoallv are central to many HPC applications, particularly those with irregular message sizes. We design, implement, and evaluate persistent MPI RMA variants of Alltoallv based on fence and…
Memory disaggregation architecture physically separates CPU and memory into independent components, which are connected via high-speed RDMA networks, greatly improving resource utilization of databases. However, such an architecture poses…
Surrogate models can play a pivotal role in enhancing performance in contemporary High-Performance Computing applications. Cache-based surrogates use already calculated simulation results to interpolate or extrapolate further simulation…
In this work, we aim to evaluate different Distributed Lock Management service designs with Remote Direct Memory Access (RDMA). In specific, we implement and evaluate the centralized and the RDMA-enabled lock manager designs for fast…
A distributed multi-writer multi-reader (MWMR) atomic register is an important primitive that enables a wide range of distributed algorithms. Hence, improving its performance can have large-scale consequences. Since the seminal work of ABD…
In this paper, a novel distributed scheduling algorithm is proposed, which aims to efficiently schedule both the uplink and downlink backhaul traffic in the relay-assisted mmWave backhaul network with a tree topology. The handshaking of…
Common implementations of core memory allocation components, like the Linux buddy system, handle concurrent allocation/release requests by synchronizing threads via spin-locks. This approach is clearly not prone to scale with large thread…
Distributed resource allocation (DRA) is fundamental to modern networked systems, spanning applications from economic dispatch in smart grids to CPU scheduling in data centers. Conventional DRA approaches require reliable communication, yet…
Remote memory access (RMA) is an emerging high-performance programming model that uses RDMA hardware directly. Yet, accessing remote memories cannot invoke activities at the target which complicates implementation and limits performance of…
Fully provisioned Message Passing Interface (MPI) parallelism achieves near-optimal wall-clock time for Computational Fluid Dynamics (CFD) solvers. This work addresses a complementary question for shared, cloud-managed clusters: can…
The memory capacity in edge devices is often limited due to constraints on cost, size, and power. Consequently, memory competition leads to inevitable page swapping in memory-constrained mixed-criticality edge devices, causing slow storage…
Efficient multi-robot task allocation (MRTA) is fundamental to various time-sensitive applications such as disaster response, warehouse operations, and construction. This paper tackles a particular class of these problems that we call…
Efficient task scheduling in large-scale distributed systems presents significant challenges due to dynamic workloads, heterogeneous resources, and competing quality-of-service requirements. Traditional centralized approaches face…