Related papers: Multi-Grained Specifications for Distributed Syste…
ZooKeeper is a coordination service, widely used as a backbone of various distributed systems. Though its reliability is of critical importance, testing is insufficient for an industrial-strength system of the size and complexity of…
TLA+ is a formal language for specifying systems, including distributed algorithms, that is supported by powerful verification tools. In this work we present a framework for relating traces of distributed programs to high-level…
Data contamination is a known threat to the reliability of model evaluation. However, it remains underexplored in code large language models (LLMs), where contamination often goes beyond exact duplication. We present TRACER, a…
Large Language Model (LLM) training is frequently interrupted by a heterogeneous spectrum of failures, from common GPU crashes to catastrophic cluster-wide outages. Existing checkpointing systems rely on monolithic, single-tier storage…
The trade-off between coarse- and fine-grained locking is a well understood issue in operating systems. Coarse-grained locking provides lower overhead under low contention, fine-grained locking provides higher scalability under contention,…
Using an algorithm due to Safra for distributed termination detection as a running example, we present the main tools for verifying specifications written in TLA+. Examining their complementary strengths and weaknesses, we suggest a…
Verifying specifications for large-scale modern engineering systems can be a time-consuming task, as most formal verification methods are limited to systems of modest size. Recently, contract-based design and verification has been proposed…
Formal modelling is a powerful tool for developing complex systems. At MongoDB, we use TLA+ to model and verify multiple aspects of several systems. Ensuring conformance between a specification and its implementation can add value to any…
Geo-replicated systems provide a number of desirable properties such as globally low latency, high availability, scalability, and built-in fault tolerance. Unfortunately, programming correct applications on top of such systems has proven to…
Labeling a classification dataset implies to define classes and associated coarse labels, that may approximate a smoother and more complicated ground truth. For example, natural images may contain multiple objects, only one of which is…
TLA+ is a formal specification language used for designing, modeling, documenting, and verifying systems through model checking. Despite significant interest from the research community, knowledge about usage of the TLA+ ecosystem in…
The last decade has sparked several valiant efforts in deductive verification of distributed agreement protocols such as consensus and leader election. Oddly, there have been far fewer verification efforts that go beyond the core protocols…
Verifying multi-threaded programs is becoming more and more important, because of the strong trend to increase the number of processing units per CPU socket. We introduce a new configurable program analysis for verifying multi-threaded…
Proving correctness of distributed or concurrent algorithms is a mind-challenging and complex process. Slight errors in the reasoning are difficult to find, calling for computer-checked proof systems. In order to build computer-checked…
Ensuring that safety-critical applications behave as intended is an important yet challenging task. Modeling languages like differential dynamic logic (dL) have proof calculi capable of proving guarantees for such applications. However, dL…
Distributed machine learning training and inference is common today because today's large models require more memory and compute than can be provided by a single GPU. Distributed models are generally produced by programmers who take a…
Formal models are essential to specifying large, complex computer systems and verifying their correctness, but are notoriously expensive to write and maintain. Recent advances in generative AI show promise in generating certain forms of…
Concurrent systems are notoriously difficult to validate: subtle bugs may only manifest under rare thread interleavings, and existing tools often require intrusive instrumentation or unrealistic execution models. We present OmniLink, a new…
Existing text classification methods mainly focus on a fixed label set, whereas many real-world applications require extending to new fine-grained classes as the number of samples per label increases. To accommodate such requirements, we…
Deep learning (DL) defines a new data-driven programming paradigm that constructs the internal system logic of a crafted neuron network through a set of training data. We have seen wide adoption of DL in many safety-critical scenarios.…