Related papers: Early Scheduling in Parallel State Machine Replica…

Optimistic Parallel State-Machine Replication

State-machine replication, a fundamental approach to fault tolerance, requires replicas to execute commands deterministically, which usually results in sequential execution of commands. Sequential execution limits performance and underuses…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-04-29 Parisa Jalili Marandi , Fernando Pedone

Rethinking State-Machine Replication for Parallelism

State-machine replication, a fundamental approach to designing fault-tolerant services, requires commands to be executed in the same order by all replicas. Moreover, command execution must be deterministic: each replica must produce the…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-11-26 Parisa Jalili Marandi , Carlos Eduardo Bezerra , Fernando Pedone

Index-Based Scheduling for Parallel State Machine Replication

State Machine Replication (SMR) is a fundamental approach to designing service with fault tolerance. However, its requirement for the deterministic execution of transactions often results in single-threaded replicas, which cannot fully…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-11-27 Gang Wu1 , Guodong Zhao , Yidong Song

Efficient Task Replication for Fast Response Times in Parallel Computation

One typical use case of large-scale distributed computing in data centers is to decompose a computation job into many independent tasks and run them in parallel on different machines, sometimes known as the "embarrassingly parallel"…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-04-07 Da Wang , Gauri Joshi , Gregory Wornell

Linearizability and State-Machine Replication: Is it a match?

Linearizability is a well-known correctness property for concurrent and distributed systems. In the past, it was also used to prove the design and implementation of replicated state-machines correct. State-machine replication (SMR) is a…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-03 Franz J. Hauck , Alexander Heß

Scheduling of Hard Real-Time Multi-Thread Periodic Tasks

In this paper we study the scheduling of parallel and real-time recurrent tasks. Firstly, we propose a new parallel task model which allows recurrent tasks to be composed of several threads, each thread requires a single processor for…

Operating Systems · Computer Science 2015-03-19 Irina Iulia Lupu , Joël Goossens

Serial Parallel Reliability Redundancy Allocation Optimization for Energy Efficient and Fault Tolerant Cloud Computing

Serial-parallel redundancy is a reliable way to ensure service and systems will be available in cloud computing. That method involves making copies of the same system or program, with only one remaining active. When an error occurs, the…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-08 Gutha Jaya Krishna

On Delay-Optimal Scheduling in Queueing Systems with Replications

In modern computer systems, jobs are divided into short tasks and executed in parallel. Empirical observations in practical systems suggest that the task service times are highly random and the job service time is bottlenecked by the…

Performance · Computer Science 2017-02-08 Yin Sun , C. Emre Koksal , Ness B. Shroff

Building State Machine Replication Using Practical Network Synchrony

Distributed systems, such as state machine replication, are critical infrastructures for modern applications. Practical distributed protocols make minimum assumptions about the underlying network: They typically assume a partially…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-18 Yiliang Wan , Nitin Shivaraman , Akshaye Shenoi , Xiang Liu , Tao Luo , Jialin Li

Parallel Combining: Benefits of Explicit Synchronization

Parallel batched data structures are designed to process synchronized batches of operations in a parallel computing model. In this paper, we propose parallel combining, a technique that implements a concurrent data structure from a parallel…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-11-14 Vitaly Aksenov , Petr Kuznetsov , Anatoly Shalyto

Scheduling and Trade-off Analysis for Multi-Source Multi-Processor Systems with Divisible Loads

The main goal of parallel processing is to provide users with performance that is much better than that of single processor systems. The execution of jobs is scheduled, which requires certain resources in order to meet certain criteria.…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-02-07 Yang Cao , Fei Wu , Thomas Robertazzi

Parallel machine scheduling with precedence constraints and setup times

This paper presents different methods for solving parallel machine scheduling problems with precedence constraints and setup times between the jobs. Limited discrepancy search methods mixed with local search principles, dominance conditions…

Data Structures and Algorithms · Computer Science 2009-02-19 Bernat Gacias , Christian Artigues , Pierre Lopez

An adaptive robust optimization model for parallel machine scheduling

Real-life parallel machine scheduling problems can be characterized by: (i) limited information about the exact task duration at scheduling time, and (ii) an opportunity to reschedule the remaining tasks each time a task processing is…

Optimization and Control · Mathematics 2023-11-22 Izack Cohen , Krzysztof Postek , Shimrit Shtern

Semi-Federated Scheduling of Parallel Real-Time Tasks on Multiprocessors

Federated scheduling is a promising approach to schedule parallel real-time tasks on multi-cores, where each heavy task exclusively executes on a number of dedicated processors, while light tasks are treated as sequential sporadic tasks and…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-05-10 Xu Jiang , Nan Guan , Xiang Long , Wang Yi

Supporting Parallelism in Server-based Multiprocessor Systems

Developing an efficient server-based real-time scheduling solution that supports dynamic task-level parallelism is now relevant to even the desktop and embedded domains and no longer only to the high performance computing market niche. This…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-06-15 Luís Nogueira , Luís Miguel Pinho

Multi-Resource Parallel Query Scheduling and Optimization

Scheduling query execution plans is a particularly complex problem in shared-nothing parallel systems, where each site consists of a collection of local time-shared (e.g., CPU(s) or disk(s)) and space-shared (e.g., memory) resources and…

Databases · Computer Science 2014-04-01 Minos Garofalakis , Yannis Ioannidis

Scheduling Parallel-Task Jobs Subject to Packing and Placement Constraints

Motivated by modern parallel computing applications, we consider the problem of scheduling parallel-task jobs with heterogeneous resource requirements in a cluster of machines. Each job consists of a set of tasks that can be processed in…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-03 Mehrnoosh Shafiee , Javad Ghaderi

Stream-based State-Machine Replication

Developing state-machine replication protocols for practical use is a complex and labor-intensive process because of the myriad of essential tasks (e.g., deployment, communication, recovery) that need to be taken into account in an…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-25 Laura Lawniczak , Tobias Distler

Multitasking Scheduling with Shared Processing

Recently, the problem of multitasking scheduling has attracted a lot of attention in the service industries where workers frequently perform multiple tasks by switching from one task to another. Hall, Leung and Li (Discrete Applied…

Data Structures and Algorithms · Computer Science 2022-04-06 Bin Fu , Yumei Huo , Hairong Zhao

SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs

In large-scale LLM pre-training systems with 100k+ GPUs, failures become the norm rather than the exception, and restart costs can dominate wall-clock training time. However, existing fault-tolerance mechanisms are largely unprepared for…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-29 Jin Lee , Zhonghao Chen , Xuhang He , Robert Underwood , Bogdan Nicolae , Franck Cappello , Xiaoyi Lu , Sheng Di , Zheng Zhang