Related papers: Data Replication for Reducing Computing Time in Di…

Efficient Replication for Straggler Mitigation in Distributed Computing

Master-worker distributed computing systems use task replication in order to mitigate the effect of slow workers, known as stragglers. Tasks are grouped into batches and assigned to one or more workers for execution. We first consider the…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-29 Amir Behrouzi-Far , Emina Soljanin

On the Effect of Task-to-Worker Assignment in Distributed Computing Systems with Stragglers

We study the expected completion time of some recently proposed algorithms for distributed computing which redundantly assign computing tasks to multiple machines in order to tolerate a certain number of machine failures. We analytically…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-08-10 Amir Behrouzi-Far , Emina Soljanin

Diversity/Parallelism Trade-off in Distributed Systems with Redundancy

As numerous machine learning and other algorithms increase in complexity and data requirements, distributed computing becomes necessary to satisfy the growing computational and storage demands, because it enables parallel execution of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-21 Pei Peng , Emina Soljanin , Philip Whiting

Optimizing Redundancy Levels in Master-Worker Compute Clusters for Straggler Mitigation

Runtime variability in computing systems causes some tasks to straggle and take much longer than expected to complete. These straggler tasks are known to significantly slowdown distributed computation. Job execution with speculative…

Performance · Computer Science 2019-06-14 Mehmet Fatih Aktas , Emina Soljanin

On Delay-Optimal Scheduling in Queueing Systems with Replications

In modern computer systems, jobs are divided into short tasks and executed in parallel. Empirical observations in practical systems suggest that the task service times are highly random and the job service time is bottlenecked by the…

Performance · Computer Science 2017-02-08 Yin Sun , C. Emre Koksal , Ness B. Shroff

Exploiting Stragglers in Distributed Computing Systems with Task Grouping

We consider the problem of stragglers in distributed computing systems. Stragglers, which are compute nodes that unpredictably slow down, often increase the completion times of tasks. One common approach to mitigating stragglers is work…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-07 Tharindu Adikari , Haider Al-Lawati , Jason Lam , Zhenhua Hu , Stark C. Draper

Fast and Straggler-Tolerant Distributed SGD with Reduced Computation Load

In distributed machine learning, a central node outsources computationally expensive calculations to external worker nodes. The properties of optimization procedures like stochastic gradient descent (SGD) can be leveraged to mitigate the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-04-19 Maximilian Egger , Serge Kas Hanna , Rawad Bitar

Efficient Task Replication for Fast Response Times in Parallel Computation

One typical use case of large-scale distributed computing in data centers is to decompose a computation job into many independent tasks and run them in parallel on different machines, sometimes known as the "embarrassingly parallel"…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-04-07 Da Wang , Gauri Joshi , Gregory Wornell

Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning

Performance of distributed optimization and learning systems is bottlenecked by "straggler" nodes and slow communication links, which significantly delay computation. We propose a distributed optimization framework where the dataset is…

Machine Learning · Statistics 2018-03-15 Can Karakus , Yifan Sun , Suhas Diggavi , Wotao Yin

Straggler Mitigation in Distributed Optimization Through Data Encoding

Slow running or straggler tasks can significantly reduce computation speed in distributed computation. Recently, coding-theory-inspired approaches have been applied to mitigate the effect of straggling, through embedding redundancy in…

Machine Learning · Statistics 2018-01-24 Can Karakus , Yifan Sun , Suhas Diggavi , Wotao Yin

Straggler Mitigation by Delayed Relaunch of Tasks

Redundancy for straggler mitigation, originally in data download and more recently in distributed computing context, has been shown to be effective both in theory and practice. Analysis of systems with redundancy has drawn significant…

Performance · Computer Science 2017-10-03 Mehmet Fatih Aktas , Pei Peng , Emina Soljanin

Straggler-Resilient Distributed Machine Learning with Dynamic Backup Workers

With the increasing demand for large-scale training of machine learning models, consensus-based distributed optimization methods have recently been advocated as alternatives to the popular parameter server framework. In this paradigm, each…

Machine Learning · Computer Science 2021-02-15 Guojun Xiong , Gang Yan , Rahul Singh , Jian Li

Coded Computation across Shared Heterogeneous Workers with Communication Delay

Distributed computing enables large-scale computation tasks to be processed over multiple workers in parallel. However, the randomness of communication and computation delays across workers causes the straggler effect, which may degrade the…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-07-20 Yuxuan Sun , Fan Zhang , Junlin Zhao , Sheng Zhou , Zhisheng Niu , Deniz Gündüz

Straggler Mitigation at Scale

Runtime performance variability at the servers has been a major issue, hindering the predictable and scalable performance in modern distributed systems. Executing requests or jobs redundantly over multiple servers has been shown to be…

Performance · Computer Science 2019-10-10 Mehmet Fatih Aktas , Emina Soljanin

Computation Scheduling for Distributed Machine Learning with Straggling Workers

We study scheduling of computation tasks across n workers in a large scale distributed learning problem with the help of a master. Computation and communication delays are assumed to be random, and redundant computations are assigned to…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-08 Mohammad Mohammadi Amiri , Deniz Gunduz

Efficient Straggler Replication in Large-scale Parallel Computing

In a cloud computing job with many parallel tasks, the tasks on the slowest machines (straggling tasks) become the bottleneck in the job completion. Computing frameworks such as MapReduce and Spark tackle this by replicating the straggling…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-09-14 Da Wang , Gauri Joshi , Gregory Wornell

Effective Straggler Mitigation: Which Clones Should Attack and When?

Redundancy for straggler mitigation, originally in data download and more recently in distributed computing context, has been shown to be effective both in theory and practice. Analysis of systems with redundancy has drawn significant…

Performance · Computer Science 2017-10-03 Mehmet Fatih Aktas , Pei Peng , Emina Soljanin

Stream Distributed Coded Computing

The emerging large-scale and data-hungry algorithms require the computations to be delegated from a central server to several worker nodes. One major challenge in the distributed computations is to tackle delays and failures caused by the…

Information Theory · Computer Science 2021-03-03 Alejandro Cohen , Guillaume Thiran , Homa Esfahanizadeh , Muriel Médard

Balanced Nonadaptive Redundancy Scheduling

Distributed computing systems implement redundancy to reduce the job completion time and variability. Despite a large body of work about computing redundancy, the analytical performance evaluation of redundancy techniques in queuing systems…

Information Theory · Computer Science 2022-01-05 Amir Behrouzi-Far , Emina Soljanin

Optimal Server Selection for Straggler Mitigation

The performance of large-scale distributed compute systems is adversely impacted by stragglers when the execution time of a job is uncertain. To manage stragglers, we consider a multi-fork approach for job scheduling, where additional…

Networking and Internet Architecture · Computer Science 2026-01-01 Ajay Badita , Parimal Parag , Vaneet Aggarwal