Related papers: Adaptive Gradient Coding
We propose a novel coding theoretic framework for mitigating stragglers in distributed learning. We show how carefully replicating data blocks and coding across gradients can provide tolerance to failures and stragglers for Synchronous…
This paper develops coding techniques to reduce the running time of distributed learning tasks. It characterizes the fundamental tradeoff to compute gradients (and more generally vector summations) in terms of three parameters: computation…
We consider distributed learning in the presence of slow and unresponsive worker nodes, referred to as stragglers. In order to mitigate the effect of stragglers, gradient coding redundantly assigns partial computations to the worker such…
Distributed implementations of gradient-based methods, wherein a server distributes gradient computations across worker machines, need to overcome two limitations: delays caused by slow running machines called 'stragglers', and…
Distributed implementations are crucial in speeding up large scale machine learning applications. Distributed gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers. A…
In distributed computing, slower nodes (stragglers) usually become a bottleneck. Gradient Coding (GC), introduced by Tandon et al., is an efficient technique that uses principles of error-correcting codes to distribute gradient computation…
Gradient descent algorithms are widely used in machine learning. In order to deal with huge volume of data, we consider the implementation of gradient descent algorithms in a distributed computing setting where multiple workers compute the…
Gradient-based distributed learning in Parameter Server (PS) computing architectures is subject to random delays due to straggling worker nodes, as well as to possible communication bottlenecks between PS and workers. Solutions have been…
Modern learning algorithms use gradient descent updates to train inferential models that best explain data. Scaling these approaches to massive data sizes requires proper distributed gradient descent schemes where distributed worker nodes…
In distributed synchronous gradient descent (GD) the main performance bottleneck for the per-iteration completion time is the slowest \textit{straggling} workers. To speed up GD iterations in the presence of stragglers, coded distributed…
In this article, we address the problem of federated learning in the presence of stragglers. For this problem, a coded federated learning framework has been proposed, where the central server aggregates gradients received from the…
We consider distributed gradient descent in the presence of stragglers. Recent work on \em gradient coding \em and \em approximate gradient coding \em have shown how to add redundancy in distributed gradient descent to guarantee convergence…
In this paper, we propose an optimally structured gradient coding scheme to mitigate the straggler problem in distributed learning. Conventional gradient coding methods often assume homogeneous straggler models or rely on excessive data…
Distributed implementations of gradient-based methods, wherein a server distributes gradient computations across worker machines, suffer from slow running machines, called 'stragglers'. Gradient coding is a coding-theoretic framework to…
Gradient coding is a technique for straggler mitigation in distributed learning. In this paper we design novel gradient codes using tools from classical coding theory, namely, cyclic MDS codes, which compare favorably with existing…
In this paper, we consider a decentralized learning problem in the presence of stragglers. Although gradient coding techniques have been developed for distributed learning to evade stragglers, where the devices send encoded gradients with…
Gradient descent (GD) methods are commonly employed in machine learning problems to optimize the parameters of the model in an iterative fashion. For problems with massive datasets, computations are distributed to many parallel computing…
When gradient descent (GD) is scaled to many parallel workers for large scale machine learning problems, its per-iteration computation time is limited by the straggling workers. Straggling workers can be tolerated by assigning redundant…
In distributed machine learning (DML), the training data is distributed across multiple worker nodes to perform the underlying training in parallel. One major problem affecting the performance of DML algorithms is presence of stragglers.…
Edge computing has recently emerged as a promising paradigm to boost the performance of distributed learning by leveraging the distributed resources at edge nodes. Architecturally, the introduction of edge nodes adds an additional…