Related papers: Efficient Task Replication for Fast Response Times…

Efficient Straggler Replication in Large-scale Parallel Computing

In a cloud computing job with many parallel tasks, the tasks on the slowest machines (straggling tasks) become the bottleneck in the job completion. Computing frameworks such as MapReduce and Spark tackle this by replicating the straggling…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-09-14 Da Wang , Gauri Joshi , Gregory Wornell

Diversity/Parallelism Trade-off in Distributed Systems with Redundancy

As numerous machine learning and other algorithms increase in complexity and data requirements, distributed computing becomes necessary to satisfy the growing computational and storage demands, because it enables parallel execution of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-21 Pei Peng , Emina Soljanin , Philip Whiting

Data Placement and Replica Selection for Improving Co-location in Distributed Environments

Increasing need for large-scale data analytics in a number of application domains has led to a dramatic rise in the number of distributed data management systems, both parallel relational databases, and systems that support alternative…

Databases · Computer Science 2013-02-19 K. Ashwin Kumar , Amol Deshpande , Samir Khuller

Efficient Replication for Straggler Mitigation in Distributed Computing

Master-worker distributed computing systems use task replication in order to mitigate the effect of slow workers, known as stragglers. Tasks are grouped into batches and assigned to one or more workers for execution. We first consider the…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-29 Amir Behrouzi-Far , Emina Soljanin

On Delay-Optimal Scheduling in Queueing Systems with Replications

In modern computer systems, jobs are divided into short tasks and executed in parallel. Empirical observations in practical systems suggest that the task service times are highly random and the job service time is bottlenecked by the…

Performance · Computer Science 2017-02-08 Yin Sun , C. Emre Koksal , Ness B. Shroff

Efficient Replication of Queued Tasks for Latency Reduction in Cloud Systems

In cloud computing systems, assigning a job to multiple servers and waiting for the earliest copy to finish is an effective method to combat the variability in response time of individual servers. Although adding redundant replicas always…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-10-21 Gauri Joshi , Emina Soljanin , Gregory Wornell

Data Replication for Reducing Computing Time in Distributed Systems with Stragglers

In distributed computing systems with stragglers, various forms of redundancy can improve the average delay performance. We study the optimal replication of data in systems where the job execution time is a stochastically decreasing and…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-01 Amir Behrouzi-Far , Emina Soljanin

Contrasting Effects of Replication in Parallel Systems: From Overload to Underload and Back

Task replication has recently been advocated as a practical solution to reduce latencies in parallel systems. In addition to several convincing empirical studies, some others provide analytical results, yet under some strong assumptions…

Performance · Computer Science 2016-02-26 Felix Poloczek , Florin Ciucu

Efficiently Scheduling Parallel DAG Tasks on Identical Multiprocessors

Parallel real-time embedded applications can be modelled as directed acyclic graphs (DAGs) whose nodes model subtasks and whose edges model precedence constraints among subtasks. Efficiently scheduling such parallel tasks can be challenging…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-24 Shardul Lendve , Konstantinos Bletsas , Pedro F. Souto

Analysis of Reinforcement Learning for determining task replication in workflows

Executing workflows on volunteer computing resources where individual tasks may be forced to relinquish their resource for the resource's primary use leads to unpredictability and often significantly increases execution time. Task…

Performance · Computer Science 2022-09-28 Andrew Stephen McGough , Matthew Forshaw

Early Scheduling in Parallel State Machine Replication

State machine replication is standard approach to fault tolerance. One of the key assumptions of state machine replication is that replicas must execute operations deterministically and thus serially. To benefit from multi-core servers,…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-15 Eduardo Alchieri , Fernando Dotti , Fernando Pedone

Replication in Graph Partitioning and Scheduling Problems

The efficient parallel execution of complex computations requires balancing the workload across processors while minimizing the communication between them. This inherent trade-off is often captured by graph partitioning or DAG scheduling…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-04 Pál András Papp , Toni Böhnlein , A. N. Yzelman

ATA: Adaptive Task Allocation for Efficient Resource Management in Distributed Machine Learning

Asynchronous methods are fundamental for parallelizing computations in distributed machine learning. They aim to accelerate training by fully utilizing all available resources. However, their greedy approach can lead to inefficiencies using…

Machine Learning · Computer Science 2025-05-23 Artavazd Maranjyan , El Mehdi Saad , Peter Richtárik , Francesco Orabona

Tuning the Tail Latency of Distributed Queries Using Replication

Querying graph data with low latency is an important requirement in application domains such as social networks and knowledge graphs. Graph queries perform multiple hops between vertices. When data is partitioned and stored across multiple…

Databases · Computer Science 2022-12-21 Nathan Ng , Hung Le , Marco Serafini

Online Distributed Scheduling on a Fault-prone Parallel System

We consider a parallel system of $m$ identical machines prone to unpredictable crashes and restarts, trying to cope with the continuous arrival of tasks to be executed. Tasks have different computational requirements (i.e., processing time…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-03-21 Elli Zavou , Antonio Fernández Anta

Exploiting Stragglers in Distributed Computing Systems with Task Grouping

We consider the problem of stragglers in distributed computing systems. Stragglers, which are compute nodes that unpredictably slow down, often increase the completion times of tasks. One common approach to mitigating stragglers is work…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-07 Tharindu Adikari , Haider Al-Lawati , Jason Lam , Zhenhua Hu , Stark C. Draper

Scheduling and Trade-off Analysis for Multi-Source Multi-Processor Systems with Divisible Loads

The main goal of parallel processing is to provide users with performance that is much better than that of single processor systems. The execution of jobs is scheduled, which requires certain resources in order to meet certain criteria.…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-02-07 Yang Cao , Fei Wu , Thomas Robertazzi

Scheduling Parallel-Task Jobs Subject to Packing and Placement Constraints

Motivated by modern parallel computing applications, we consider the problem of scheduling parallel-task jobs with heterogeneous resource requirements in a cluster of machines. Each job consists of a set of tasks that can be processed in…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-03 Mehrnoosh Shafiee , Javad Ghaderi

Resource allocation for task-level speculative scientific applications: a proof of concept using Parallel Trajectory Splicing

The constant increase in parallelism available on large-scale distributed computers poses major scalability challenges to many scientific applications. A common strategy to improve scalability is to express the algorithm in terms of…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-23 Andrew Garmon , Vinay Ramakrishnaiah , Danny Perez

A Comparative Study of Replication Techniques in Grid Computing Systems

Grid Computing is a type of parallel and distributed systems that is designed to provide reliable access to data and computational resources in wide area networks. These resources are distributed in different geographical locations, however…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-09-27 Sheida Dayyani , Mohammad Reza Khayyambashi