Related papers: Speeding Up Distributed Machine Learning Using Cod…
We study the problem of computing matrix chain multiplications in a distributed computing cluster. In such systems, performance is often limited by the straggler problem, where the slowest worker dominates the overall computation latency.…
Distributed computing enables large-scale computation tasks to be processed over multiple workers in parallel. However, the randomness of communication and computation delays across workers causes the straggler effect, which may degrade the…
Data shuffling between distributed cluster of nodes is one of the critical steps in implementing large-scale learning algorithms. Randomly shuffling the data-set among a cluster of workers allows different nodes to obtain fresh data…
Coded computing is an effective technique to mitigate "stragglers" in large-scale and distributed matrix multiplication. In particular, univariate polynomial codes have been shown to be effective in straggler mitigation by making the…
Slow working nodes, known as stragglers, can greatly reduce the speed of distributed computation. Coded matrix multiplication is a recently introduced technique that enables straggler-resistant distributed multiplication of large matrices.…
Straggler nodes are well-known bottlenecks of distributed matrix computations which induce reductions in computation/communication speeds. A common strategy for mitigating such stragglers is to incorporate Reed-Solomon based MDS (maximum…
In a distributed computing system operating according to the map-shuffle-reduce framework, coding data prior to storage can be useful both to reduce the latency caused by straggling servers and to decrease the inter-server communication…
Large-scale distributed computing systems face two major bottlenecks that limit their scalability: straggler delay caused by the variability of computation times at different worker nodes and communication bottlenecks caused by shuffling…
This paper aims to mitigate straggler effects in synchronous distributed learning for multi-agent reinforcement learning (MARL) problems. Stragglers arise frequently in a distributed learning system, due to the existence of various system…
Distributed computation is a framework used to break down a complex computational task into smaller tasks and distributing them among computational nodes. Erasure correction codes have recently been introduced and have become a popular…
In large-scale distributed computing clusters, such as Amazon EC2, there are several types of "system noise" that can result in major degradation of performance: bottlenecks due to limited communication bandwidth, latency due to straggler…
We consider the data shuffling problem in a distributed learning system, in which a master node is connected to a set of worker nodes, via a shared link, in order to communicate a set of files to the worker nodes. The master node has access…
We consider the problem of private distributed matrix multiplication under limited resources. Coded computation has been shown to be an effective solution in distributed matrix multiplication, both providing privacy against the workers and…
In cloud computing systems slow processing nodes, often referred to as "stragglers", can significantly extend the computation time. Recent results have shown that error correction coding can be used to reduce the effect of stragglers. In…
Matrix multiplication is a fundamental building block for large scale computations arising in various applications, including machine learning. There has been significant recent interest in using coding to speed up distributed matrix…
Distributed matrix computations -- matrix-matrix or matrix-vector multiplications -- are well-recognized to suffer from the problem of stragglers (slow or failed worker nodes). Much of prior work in this area is (i) either sub-optimal in…
In distributed computing systems slow working nodes, known as stragglers, can greatly extend finishing times. Coded computing is a technique that enables straggler-resistant computation. Most coded computing techniques presented to date…
Matrix computations are a fundamental building-block of edge computing systems, with a major recent uptick in demand due to their use in AI/ML training and inference procedures. Existing approaches for distributing matrix computations…
We consider the problem of massive matrix multiplication, which underlies many data analytic applications, in a large-scale distributed system comprising a group of worker nodes. We target the stragglers' delay performance bottleneck, which…
Distributed multi-task learning (DMTL) effectively improves model generalization performance through the collaborative training of multiple related models. However, in large-scale learning scenarios, communication bottlenecks severely limit…