Related papers: Efficient Straggler Replication in Large-scale Par…

Efficient Task Replication for Fast Response Times in Parallel Computation

One typical use case of large-scale distributed computing in data centers is to decompose a computation job into many independent tasks and run them in parallel on different machines, sometimes known as the "embarrassingly parallel"…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-04-07 Da Wang , Gauri Joshi , Gregory Wornell

Efficient Replication for Straggler Mitigation in Distributed Computing

Master-worker distributed computing systems use task replication in order to mitigate the effect of slow workers, known as stragglers. Tasks are grouped into batches and assigned to one or more workers for execution. We first consider the…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-29 Amir Behrouzi-Far , Emina Soljanin

Straggler Mitigation by Delayed Relaunch of Tasks

Redundancy for straggler mitigation, originally in data download and more recently in distributed computing context, has been shown to be effective both in theory and practice. Analysis of systems with redundancy has drawn significant…

Performance · Computer Science 2017-10-03 Mehmet Fatih Aktas , Pei Peng , Emina Soljanin

Effective Straggler Mitigation: Which Clones Should Attack and When?

Redundancy for straggler mitigation, originally in data download and more recently in distributed computing context, has been shown to be effective both in theory and practice. Analysis of systems with redundancy has drawn significant…

Performance · Computer Science 2017-10-03 Mehmet Fatih Aktas , Pei Peng , Emina Soljanin

Efficient Replication of Queued Tasks for Latency Reduction in Cloud Systems

In cloud computing systems, assigning a job to multiple servers and waiting for the earliest copy to finish is an effective method to combat the variability in response time of individual servers. Although adding redundant replicas always…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-10-21 Gauri Joshi , Emina Soljanin , Gregory Wornell

Straggler Mitigation at Scale

Runtime performance variability at the servers has been a major issue, hindering the predictable and scalable performance in modern distributed systems. Executing requests or jobs redundantly over multiple servers has been shown to be…

Performance · Computer Science 2019-10-10 Mehmet Fatih Aktas , Emina Soljanin

On Delay-Optimal Scheduling in Queueing Systems with Replications

In modern computer systems, jobs are divided into short tasks and executed in parallel. Empirical observations in practical systems suggest that the task service times are highly random and the job service time is bottlenecked by the…

Performance · Computer Science 2017-02-08 Yin Sun , C. Emre Koksal , Ness B. Shroff

Exploiting Stragglers in Distributed Computing Systems with Task Grouping

We consider the problem of stragglers in distributed computing systems. Stragglers, which are compute nodes that unpredictably slow down, often increase the completion times of tasks. One common approach to mitigating stragglers is work…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-07 Tharindu Adikari , Haider Al-Lawati , Jason Lam , Zhenhua Hu , Stark C. Draper

Diversity/Parallelism Trade-off in Distributed Systems with Redundancy

As numerous machine learning and other algorithms increase in complexity and data requirements, distributed computing becomes necessary to satisfy the growing computational and storage demands, because it enables parallel execution of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-21 Pei Peng , Emina Soljanin , Philip Whiting

Optimizing Redundancy Levels in Master-Worker Compute Clusters for Straggler Mitigation

Runtime variability in computing systems causes some tasks to straggle and take much longer than expected to complete. These straggler tasks are known to significantly slowdown distributed computation. Job execution with speculative…

Performance · Computer Science 2019-06-14 Mehmet Fatih Aktas , Emina Soljanin

Detecting Straggler MapReduce Tasks in Big Data Processing Infrastructure by Neural Network

Straggler task detection is one of the main challenges in applying MapReduce for parallelizing and distributing large-scale data processing. It is defined as detecting running tasks on weak nodes. Considering two stages in the Map phase…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-14 Amir Javadpour , Guojun Wang , Samira Rezaei , Kuan Ching Li

Task-Cloning Algorithms in a MapReduce Cluster with Competitive Performance Bounds

Job scheduling for a MapReduce cluster has been an active research topic in recent years. However, measurement traces from real-world production environment show that the duration of tasks within a job vary widely. The overall elapsed time…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-01-13 Huanle Xu , Wing Cheong Lau

Optimization for Speculative Execution of Multiple Jobs in a MapReduce-like Cluster

Nowadays, a computing cluster in a typical data center can easily consist of hundreds of thousands of commodity servers, making component/ machine failures the norm rather than exception. A parallel processing job can be delayed…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-01-06 Huanle Xu , Wing Cheong Lau

Data Replication for Reducing Computing Time in Distributed Systems with Stragglers

In distributed computing systems with stragglers, various forms of redundancy can improve the average delay performance. We study the optimal replication of data in systems where the job execution time is a stochastically decreasing and…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-01 Amir Behrouzi-Far , Emina Soljanin

On the Effect of Task-to-Worker Assignment in Distributed Computing Systems with Stragglers

We study the expected completion time of some recently proposed algorithms for distributed computing which redundantly assign computing tasks to multiple machines in order to tolerate a certain number of machine failures. We analytically…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-08-10 Amir Behrouzi-Far , Emina Soljanin

Efficient Redundancy Techniques for Latency Reduction in Cloud Systems

In cloud computing systems, assigning a task to multiple servers and waiting for the earliest copy to finish is an effective method to combat the variability in response time of individual servers, and reduce latency. But adding redundancy…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-04-13 Gauri Joshi , Emina Soljanin , Gregory Wornell

Tuning the Tail Latency of Distributed Queries Using Replication

Querying graph data with low latency is an important requirement in application domains such as social networks and knowledge graphs. Graph queries perform multiple hops between vertices. When data is partitioned and stored across multiple…

Databases · Computer Science 2022-12-21 Nathan Ng , Hung Le , Marco Serafini

Replication in Graph Partitioning and Scheduling Problems

The efficient parallel execution of complex computations requires balancing the workload across processors while minimizing the communication between them. This inherent trade-off is often captured by graph partitioning or DAG scheduling…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-04 Pál András Papp , Toni Böhnlein , A. N. Yzelman

Straggler Mitigation in Distributed Optimization Through Data Encoding

Slow running or straggler tasks can significantly reduce computation speed in distributed computation. Recently, coding-theory-inspired approaches have been applied to mitigate the effect of straggling, through embedding redundancy in…

Machine Learning · Statistics 2018-01-24 Can Karakus , Yifan Sun , Suhas Diggavi , Wotao Yin

Hierarchical Coded Matrix Multiplication

Slow working nodes, known as stragglers, can greatly reduce the speed of distributed computation. Coded matrix multiplication is a recently introduced technique that enables straggler-resistant distributed multiplication of large matrices.…

Information Theory · Computer Science 2019-07-23 Shahrzad Kiani , Nuwan Ferdinand , Stark C. Draper