English
Related papers

Related papers: Efficient Straggler Replication in Large-scale Par…

200 papers

One typical use case of large-scale distributed computing in data centers is to decompose a computation job into many independent tasks and run them in parallel on different machines, sometimes known as the "embarrassingly parallel"…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-04-07 Da Wang , Gauri Joshi , Gregory Wornell

Master-worker distributed computing systems use task replication in order to mitigate the effect of slow workers, known as stragglers. Tasks are grouped into batches and assigned to one or more workers for execution. We first consider the…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-29 Amir Behrouzi-Far , Emina Soljanin

Redundancy for straggler mitigation, originally in data download and more recently in distributed computing context, has been shown to be effective both in theory and practice. Analysis of systems with redundancy has drawn significant…

Performance · Computer Science 2017-10-03 Mehmet Fatih Aktas , Pei Peng , Emina Soljanin

Redundancy for straggler mitigation, originally in data download and more recently in distributed computing context, has been shown to be effective both in theory and practice. Analysis of systems with redundancy has drawn significant…

Performance · Computer Science 2017-10-03 Mehmet Fatih Aktas , Pei Peng , Emina Soljanin

In cloud computing systems, assigning a job to multiple servers and waiting for the earliest copy to finish is an effective method to combat the variability in response time of individual servers. Although adding redundant replicas always…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-10-21 Gauri Joshi , Emina Soljanin , Gregory Wornell

Runtime performance variability at the servers has been a major issue, hindering the predictable and scalable performance in modern distributed systems. Executing requests or jobs redundantly over multiple servers has been shown to be…

Performance · Computer Science 2019-10-10 Mehmet Fatih Aktas , Emina Soljanin

In modern computer systems, jobs are divided into short tasks and executed in parallel. Empirical observations in practical systems suggest that the task service times are highly random and the job service time is bottlenecked by the…

Performance · Computer Science 2017-02-08 Yin Sun , C. Emre Koksal , Ness B. Shroff

We consider the problem of stragglers in distributed computing systems. Stragglers, which are compute nodes that unpredictably slow down, often increase the completion times of tasks. One common approach to mitigating stragglers is work…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-07 Tharindu Adikari , Haider Al-Lawati , Jason Lam , Zhenhua Hu , Stark C. Draper

As numerous machine learning and other algorithms increase in complexity and data requirements, distributed computing becomes necessary to satisfy the growing computational and storage demands, because it enables parallel execution of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-21 Pei Peng , Emina Soljanin , Philip Whiting

Runtime variability in computing systems causes some tasks to straggle and take much longer than expected to complete. These straggler tasks are known to significantly slowdown distributed computation. Job execution with speculative…

Performance · Computer Science 2019-06-14 Mehmet Fatih Aktas , Emina Soljanin

Straggler task detection is one of the main challenges in applying MapReduce for parallelizing and distributing large-scale data processing. It is defined as detecting running tasks on weak nodes. Considering two stages in the Map phase…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-14 Amir Javadpour , Guojun Wang , Samira Rezaei , Kuan Ching Li

Job scheduling for a MapReduce cluster has been an active research topic in recent years. However, measurement traces from real-world production environment show that the duration of tasks within a job vary widely. The overall elapsed time…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-01-13 Huanle Xu , Wing Cheong Lau

Nowadays, a computing cluster in a typical data center can easily consist of hundreds of thousands of commodity servers, making component/ machine failures the norm rather than exception. A parallel processing job can be delayed…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-01-06 Huanle Xu , Wing Cheong Lau

In distributed computing systems with stragglers, various forms of redundancy can improve the average delay performance. We study the optimal replication of data in systems where the job execution time is a stochastically decreasing and…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-01 Amir Behrouzi-Far , Emina Soljanin

We study the expected completion time of some recently proposed algorithms for distributed computing which redundantly assign computing tasks to multiple machines in order to tolerate a certain number of machine failures. We analytically…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-08-10 Amir Behrouzi-Far , Emina Soljanin

In cloud computing systems, assigning a task to multiple servers and waiting for the earliest copy to finish is an effective method to combat the variability in response time of individual servers, and reduce latency. But adding redundancy…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-04-13 Gauri Joshi , Emina Soljanin , Gregory Wornell

Querying graph data with low latency is an important requirement in application domains such as social networks and knowledge graphs. Graph queries perform multiple hops between vertices. When data is partitioned and stored across multiple…

Databases · Computer Science 2022-12-21 Nathan Ng , Hung Le , Marco Serafini

The efficient parallel execution of complex computations requires balancing the workload across processors while minimizing the communication between them. This inherent trade-off is often captured by graph partitioning or DAG scheduling…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-04 Pál András Papp , Toni Böhnlein , A. N. Yzelman

Slow running or straggler tasks can significantly reduce computation speed in distributed computation. Recently, coding-theory-inspired approaches have been applied to mitigate the effect of straggling, through embedding redundancy in…

Machine Learning · Statistics 2018-01-24 Can Karakus , Yifan Sun , Suhas Diggavi , Wotao Yin

Slow working nodes, known as stragglers, can greatly reduce the speed of distributed computation. Coded matrix multiplication is a recently introduced technique that enables straggler-resistant distributed multiplication of large matrices.…

Information Theory · Computer Science 2019-07-23 Shahrzad Kiani , Nuwan Ferdinand , Stark C. Draper
‹ Prev 1 2 3 10 Next ›