Related papers: Timely-Throughput Optimal Coded Computing over Clo…
Large scale clusters leveraging distributed computing frameworks such as MapReduce routinely process data that are on the orders of petabytes or more. The sheer size of the data precludes the processing of the data on a single computer. The…
In large-scale distributed computing clusters, such as Amazon EC2, there are several types of "system noise" that can result in major degradation of performance: bottlenecks due to limited communication bandwidth, latency due to straggler…
In distributed computing systems slow working nodes, known as stragglers, can greatly extend finishing times. Coded computing is a technique that enables straggler-resistant computation. Most coded computing techniques presented to date…
While performing distributed computations in today's cloud-based platforms, execution speed variations among compute nodes can significantly reduce the performance and create bottlenecks like stragglers. Coded computation techniques…
Elasticity is one important feature in modern cloud computing systems and can result in computation failure or significantly increase computing time. Such elasticity means that virtual machines over the cloud can be preempted under a short…
Elasticity is offered by cloud service providers to exploit under-utilized computing resources. The low-cost elastic nodes can leave and join any time during the computation cycle. The possibility of elastic events occurring together with…
Slow working nodes, known as stragglers, can greatly reduce the speed of distributed computation. Coded matrix multiplication is a recently introduced technique that enables straggler-resistant distributed multiplication of large matrices.…
Mobile-edge computing (MEC) emerges as a promising paradigm to improve the quality of computation experience for mobile devices. Nevertheless, the design of computation task scheduling policies for MEC systems inevitably encounters a…
We propose a unified coded framework for distributed computing with straggling servers, by introducing a tradeoff between "latency of computation" and "load of communication" for some linear computation tasks. We show that the coded scheme…
Coded computing has emerged as a promising framework for tackling significant challenges in large-scale distributed computing, including the presence of slow, faulty, or compromised servers. In this approach, each worker node processes a…
This paper considers a Markov decision model for profit maximization of a cloud computing service provider catering to customers submitting jobs with firm real-time random deadlines. Customers are charged on a per-job basis, receiving a…
Distributed cloud environments hosting data-intensive applications often experience slowdowns due to network congestion, asymmetric bandwidth, and inter-node data shuffling. These factors are typically not captured by traditional host-level…
Distributed computing, in which a resource-intensive task is divided into subtasks and distributed among different machines, plays a key role in solving large-scale problems. Coded computing is a recently emerging paradigm where redundancy…
Motivated by the increasing importance of providing delay-guaranteed services in general computing and communication systems, and the recent wide adoption of learning and prediction in network control, in this work, we consider a general…
Recent literature including our past work provide analysis and solutions for using (i) erasure coding, (ii) parallelism, or (iii) variable slicing/chunking (i.e., dividing an object of a specific size into a variable number of smaller…
We study scheduling of computation tasks across n workers in a large scale distributed learning problem with the help of a master. Computation and communication delays are assumed to be random, and redundant computations are assigned to…
Large-scale distributed computing systems face two major bottlenecks that limit their scalability: straggler delay caused by the variability of computation times at different worker nodes and communication bottlenecks caused by shuffling…
Today's data centers have an abundance of computing resources, hosting server clusters consisting of as many as tens or hundreds of thousands of machines. To execute a complex computing task over a data center, it is natural to distribute…
To improve the utility of learning applications and render machine learning solutions feasible for complex applications, a substantial amount of heavy computations is needed. Thus, it is essential to delegate the computations among several…
Consider a multi-cell mobile edge computing network, in which each user wishes to compute the product of a user-generated data matrix with a network-stored matrix. This is done through task offloading by means of input uploading,…