Related papers: Optimal Load Allocation for Coded Distributed Comp…
Coded distributed computing framework enables large-scale machine learning (ML) models to be trained efficiently in a distributed manner, while mitigating the straggler effect. In this work, we consider a multi-task assignment problem in a…
Distributed computing enables large-scale computation tasks to be processed over multiple workers in parallel. However, the randomness of communication and computation delays across workers causes the straggler effect, which may degrade the…
The emerging large-scale and data-hungry algorithms require the computations to be delegated from a central server to several worker nodes. One major challenge in the distributed computations is to tackle delays and failures caused by the…
To improve the utility of learning applications and render machine learning solutions feasible for complex applications, a substantial amount of heavy computations is needed. Thus, it is essential to delegate the computations among several…
We consider the recently proposed Coded Distributed Computing (CDC) framework that leverages carefully designed redundant computations to enable coding opportunities that substantially reduce the communication load of distributed computing.…
Gradient descent algorithms are widely used in machine learning. In order to deal with huge volume of data, we consider the implementation of gradient descent algorithms in a distributed computing setting where multiple workers compute the…
In large-scale distributed computing clusters, such as Amazon EC2, there are several types of "system noise" that can result in major degradation of performance: bottlenecks due to limited communication bandwidth, latency due to straggler…
In distributed machine learning, a central node outsources computationally expensive calculations to external worker nodes. The properties of optimization procedures like stochastic gradient descent (SGD) can be leveraged to mitigate the…
The problem of minimizing mean response time of generic jobs submitted to a heterogenous distributed computer systems is considered in this paper. A static load balancing strategy, in which decision of redistribution of loads does not…
We consider distributed learning in the presence of slow and unresponsive worker nodes, referred to as stragglers. In order to mitigate the effect of stragglers, gradient coding redundantly assigns partial computations to the worker such…
Today's data centers have an abundance of computing resources, hosting server clusters consisting of as many as tens or hundreds of thousands of machines. To execute a complex computing task over a data center, it is natural to distribute…
Large-scale distributed computing systems face two major bottlenecks that limit their scalability: straggler delay caused by the variability of computation times at different worker nodes and communication bottlenecks caused by shuffling…
Owing to data-intensive large-scale applications, distributed computation systems have gained significant recent interest, due to their ability of running such tasks over a large number of commodity nodes in a time efficient manner. One of…
Coding for distributed computing supports low-latency computation by relieving the burden of straggling workers. While most existing works assume a simple master-worker model, we consider a hierarchical computational structure consisting of…
Edge computing has recently emerged as a promising paradigm to boost the performance of distributed learning by leveraging the distributed resources at edge nodes. Architecturally, the introduction of edge nodes adds an additional…
We propose a unified coded framework for distributed computing with straggling servers, by introducing a tradeoff between "latency of computation" and "load of communication" for some linear computation tasks. We show that the coded scheme…
Streaming applications frequently encounter skewed workloads and execute on heterogeneous clusters. Optimal resource utilization in such adverse conditions becomes a challenge, as it requires inferring the resource capacities and input…
Master-worker distributed computing systems use task replication in order to mitigate the effect of slow workers, known as stragglers. Tasks are grouped into batches and assigned to one or more workers for execution. We first consider the…
Coded distributed computing has been considered as a promising technique which makes large-scale systems robust to the "straggler" workers. Yet, practical system models for distributed computing have not been available that reflect the…
In this paper, we propose an optimally structured gradient coding scheme to mitigate the straggler problem in distributed learning. Conventional gradient coding methods often assume homogeneous straggler models or rely on excessive data…