Related papers: Early Scheduling in Parallel State Machine Replica…
State-machine replication, a fundamental approach to fault tolerance, requires replicas to execute commands deterministically, which usually results in sequential execution of commands. Sequential execution limits performance and underuses…
State-machine replication, a fundamental approach to designing fault-tolerant services, requires commands to be executed in the same order by all replicas. Moreover, command execution must be deterministic: each replica must produce the…
State Machine Replication (SMR) is a fundamental approach to designing service with fault tolerance. However, its requirement for the deterministic execution of transactions often results in single-threaded replicas, which cannot fully…
One typical use case of large-scale distributed computing in data centers is to decompose a computation job into many independent tasks and run them in parallel on different machines, sometimes known as the "embarrassingly parallel"…
Linearizability is a well-known correctness property for concurrent and distributed systems. In the past, it was also used to prove the design and implementation of replicated state-machines correct. State-machine replication (SMR) is a…
In this paper we study the scheduling of parallel and real-time recurrent tasks. Firstly, we propose a new parallel task model which allows recurrent tasks to be composed of several threads, each thread requires a single processor for…
Serial-parallel redundancy is a reliable way to ensure service and systems will be available in cloud computing. That method involves making copies of the same system or program, with only one remaining active. When an error occurs, the…
In modern computer systems, jobs are divided into short tasks and executed in parallel. Empirical observations in practical systems suggest that the task service times are highly random and the job service time is bottlenecked by the…
Distributed systems, such as state machine replication, are critical infrastructures for modern applications. Practical distributed protocols make minimum assumptions about the underlying network: They typically assume a partially…
Parallel batched data structures are designed to process synchronized batches of operations in a parallel computing model. In this paper, we propose parallel combining, a technique that implements a concurrent data structure from a parallel…
The main goal of parallel processing is to provide users with performance that is much better than that of single processor systems. The execution of jobs is scheduled, which requires certain resources in order to meet certain criteria.…
This paper presents different methods for solving parallel machine scheduling problems with precedence constraints and setup times between the jobs. Limited discrepancy search methods mixed with local search principles, dominance conditions…
Real-life parallel machine scheduling problems can be characterized by: (i) limited information about the exact task duration at scheduling time, and (ii) an opportunity to reschedule the remaining tasks each time a task processing is…
Federated scheduling is a promising approach to schedule parallel real-time tasks on multi-cores, where each heavy task exclusively executes on a number of dedicated processors, while light tasks are treated as sequential sporadic tasks and…
Developing an efficient server-based real-time scheduling solution that supports dynamic task-level parallelism is now relevant to even the desktop and embedded domains and no longer only to the high performance computing market niche. This…
Scheduling query execution plans is a particularly complex problem in shared-nothing parallel systems, where each site consists of a collection of local time-shared (e.g., CPU(s) or disk(s)) and space-shared (e.g., memory) resources and…
Motivated by modern parallel computing applications, we consider the problem of scheduling parallel-task jobs with heterogeneous resource requirements in a cluster of machines. Each job consists of a set of tasks that can be processed in…
Developing state-machine replication protocols for practical use is a complex and labor-intensive process because of the myriad of essential tasks (e.g., deployment, communication, recovery) that need to be taken into account in an…
Recently, the problem of multitasking scheduling has attracted a lot of attention in the service industries where workers frequently perform multiple tasks by switching from one task to another. Hall, Leung and Li (Discrete Applied…
In large-scale LLM pre-training systems with 100k+ GPUs, failures become the norm rather than the exception, and restart costs can dominate wall-clock training time. However, existing fault-tolerance mechanisms are largely unprepared for…