Related papers: Tracing Distributed Algorithms Using Replay Clocks

Replay Clocks

In this work, we focus on the problem of replay clocks (RepCL). The need for replay clocks arises from the observation that analyzing distributed computation for all desired properties of interest may not be feasible in an online…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-15 Ishaan Lagwankar , Sandeep S Kulkarni

Efficient Task Replication for Fast Response Times in Parallel Computation

One typical use case of large-scale distributed computing in data centers is to decompose a computation job into many independent tasks and run them in parallel on different machines, sometimes known as the "embarrassingly parallel"…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-04-07 Da Wang , Gauri Joshi , Gregory Wornell

Causality Diagrams using Hybrid Vector Clocks

Causality in distributed systems is a concept that has long been explored and numerous approaches have been made to use causality as a way to trace distributed system execution. Traditional approaches usually used system profiling and newer…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-14 Ishaan Lagwankar , Kanishka Wijewardena

Cyclotron: Compilation of Recurrences to Distributed and Systolic Architectures

We present Cyclotron, a framework and compiler for using recurrence equations to express streaming dataflow algorithms, which then get portably compiled to distributed topologies of interlinked processors. Our framework provides an input…

Programming Languages · Computer Science 2025-11-14 Shiv Sundram , Akhilesh Balasingam , Nathan Zhang , Kunle Olukotun , Fredrik Kjolstad

DPCP-p: A Distributed Locking Protocol for Parallel Real-Time Tasks

Real-time scheduling and locking protocols are fundamental facilities to construct time-critical systems. For parallel real-time tasks, predictable locking protocols are required when concurrent sub-jobs mutually exclusive access to shared…

Operating Systems · Computer Science 2020-07-03 Maolin Yang , Zewei Chen , Xu Jiang , Nan Guan , Hang Lei

Analysis of Workflow Schedulers in Simulated Distributed Environments

Task graphs provide a simple way to describe scientific workflows (sets of tasks with dependencies) that can be executed on both HPC clusters and in the cloud. An important aspect of executing such graphs is the used scheduling algorithm.…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-04-18 Jakub Beránek , Stanislav Böhm , Vojtěch Cima

Tuning the Tail Latency of Distributed Queries Using Replication

Querying graph data with low latency is an important requirement in application domains such as social networks and knowledge graphs. Graph queries perform multiple hops between vertices. When data is partitioned and stored across multiple…

Databases · Computer Science 2022-12-21 Nathan Ng , Hung Le , Marco Serafini

On Performance Debugging of Unnecessary Lock Contentions on Multicore Processors: A Replay-based Approach

Locks have been widely used as an effective synchronization mechanism among processes and threads. However, we observe that a large number of false inter-thread dependencies (i.e., unnecessary lock contentions) exist during the program…

Programming Languages · Computer Science 2015-04-22 Long Zheng , Xiaofei Liao , Bingsheng He , Song Wu , Hai Jin

Efficient Straggler Replication in Large-scale Parallel Computing

In a cloud computing job with many parallel tasks, the tasks on the slowest machines (straggling tasks) become the bottleneck in the job completion. Computing frameworks such as MapReduce and Spark tackle this by replicating the straggling…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-09-14 Da Wang , Gauri Joshi , Gregory Wornell

Replication in Graph Partitioning and Scheduling Problems

The efficient parallel execution of complex computations requires balancing the workload across processors while minimizing the communication between them. This inherent trade-off is often captured by graph partitioning or DAG scheduling…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-04 Pál András Papp , Toni Böhnlein , A. N. Yzelman

Distributed Order Recording Techniques for Efficient Record-and-Replay of Multi-threaded Programs

After all these years and all these other shared memory programming frameworks, OpenMP is still the most popular one. However, its greater levels of non-deterministic execution makes debugging and testing more challenging. The ability to…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-19 Xiang Fu , Shiman Meng , Weiping Zhang , Luanzheng Guo , Kento Sato , Dong H. Ahn , Ignacio Laguna , Gregory L. Lee , Martin Schulz

Efficiently Scheduling Parallel DAG Tasks on Identical Multiprocessors

Parallel real-time embedded applications can be modelled as directed acyclic graphs (DAGs) whose nodes model subtasks and whose edges model precedence constraints among subtasks. Efficiently scheduling such parallel tasks can be challenging…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-24 Shardul Lendve , Konstantinos Bletsas , Pedro F. Souto

Online Continual Learning on Sequences

Online continual learning (OCL) refers to the ability of a system to learn over time from a continuous stream of data without having to revisit previously encountered training samples. Learning continually in a single data pass is crucial…

Machine Learning · Computer Science 2020-03-23 German I. Parisi , Vincenzo Lomonaco

Graph-based Algorithms for Linear Computation Coding

We revisit existing linear computation coding (LCC) algorithms, and introduce a new framework that measures the computational cost of computing multidimensional linear functions, not only in terms of the number of additions, but also with…

Information Theory · Computer Science 2024-01-17 Hans Rosenberger , Ali Bereyhi , Ralf R. Müller

Reinforcement Learning in Computing and Network Convergence Orchestration

As computing power is becoming the core productivity of the digital economy era, the concept of Computing and Network Convergence (CNC), under which network and computing resources can be dynamically scheduled and allocated according to…

Networking and Internet Architecture · Computer Science 2022-09-23 Aidong Yang , Mohan Wu , Boquan Cheng , Xiaozhou Ye , Ye Ouyang

Learn the Time to Learn: Replay Scheduling in Continual Learning

Replay methods are known to be successful at mitigating catastrophic forgetting in continual learning scenarios despite having limited access to historical data. However, storing historical data is cheap in many real-world settings, yet…

Machine Learning · Computer Science 2023-11-22 Marcus Klasson , Hedvig Kjellström , Cheng Zhang

RepFlow: Minimizing Flow Completion Times with Replicated Flows in Data Centers

Short TCP flows that are critical for many interactive applications in data centers are plagued by large flows and head-of-line blocking in switches. Hash-based load balancing schemes such as ECMP aggravate the matter and result in…

Networking and Internet Architecture · Computer Science 2016-11-18 Hong Xu , Baochun Li

Temporal Computer Organization

This document is focused on computing systems implemented in technologies that communicate and compute with temporal transients. Although described in general terms, implementations of spiking neural networks are of primary interest. As…

Neural and Evolutionary Computing · Computer Science 2022-01-20 James E. Smith

Efficient Multidimensional Data Redistribution for Resizable Parallel Computations

Traditional parallel schedulers running on cluster supercomputers support only static scheduling, where the number of processors allocated to an application remains fixed throughout the execution of the job. This results in…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-06-15 Rajesh Sudarsan , Calvin J. Ribbens

Online Distributed Scheduling on a Fault-prone Parallel System

We consider a parallel system of $m$ identical machines prone to unpredictable crashes and restarts, trying to cope with the continuous arrival of tasks to be executed. Tasks have different computational requirements (i.e., processing time…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-03-21 Elli Zavou , Antonio Fernández Anta