Related papers: Straggler-Resilient Distributed Machine Learning w…

Fast and Straggler-Tolerant Distributed SGD with Reduced Computation Load

In distributed machine learning, a central node outsources computationally expensive calculations to external worker nodes. The properties of optimization procedures like stochastic gradient descent (SGD) can be leveraged to mitigate the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-04-19 Maximilian Egger , Serge Kas Hanna , Rawad Bitar

Straggler-Resilient Decentralized Learning via Adaptive Asynchronous Updates

With the increasing demand for large-scale training of machine learning models, fully decentralized optimization methods have recently been advocated as alternatives to the popular parameter server framework. In this paradigm, each worker…

Machine Learning · Computer Science 2024-07-10 Guojun Xiong , Gang Yan , Shiqiang Wang , Jian Li

Dynamic backup workers for parallel machine learning

The most popular framework for distributed training of machine learning models is the (synchronous) parameter server (PS). This paradigm consists of $n$ workers, which iteratively compute updates of the model parameters, and a stateful PS,…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-01-26 Chuan Xu , Giovanni Neglia , Nicola Sebastianelli

Straggler-Robust Distributed Optimization in Parameter-Server Networks

Optimization in distributed networks plays a central role in almost all distributed machine learning problems. In principle, the use of distributed task allocation has reduced the computational time, allowing better response rates and…

Optimization and Control · Mathematics 2021-08-23 Elie Atallah , Nazanin Rahnavard , Chinwendu Enyioha

Revisiting Distributed Synchronous SGD

Distributed training of deep learning models on large-scale training data is typically conducted with asynchronous stochastic optimization to maximize the rate of updates, at the cost of additional noise introduced from asynchrony. In…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-03-21 Xinghao Pan , Jianmin Chen , Rajat Monga , Samy Bengio , Rafal Jozefowicz

Revisiting Distributed Synchronous SGD

Distributed training of deep learning models on large-scale training data is typically conducted with asynchronous stochastic optimization to maximize the rate of updates, at the cost of additional noise introduced from asynchrony. In…

Machine Learning · Computer Science 2017-03-22 Jianmin Chen , Xinghao Pan , Rajat Monga , Samy Bengio , Rafal Jozefowicz

Straggler-Robust Distributed Optimization with the Parameter Server Utilizing Coded Gradient

Optimization in distributed networks plays a central role in almost all distributed machine learning problems. In principle, the use of distributed task allocation has reduced the computational time, allowing better response rates and…

Optimization and Control · Mathematics 2020-07-28 Elie Atallah , Nazanin Rahnavard , Chinwendu Enyioha

Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers

We consider the setting where a master wants to run a distributed stochastic gradient descent (SGD) algorithm on $n$ workers each having a subset of the data. Distributed SGD may suffer from the effect of stragglers, i.e., slow or…

Machine Learning · Computer Science 2023-10-18 Serge Kas Hanna , Rawad Bitar , Parimal Parag , Venkat Dasari , Salim El Rouayheb

Efficient Replication for Straggler Mitigation in Distributed Computing

Master-worker distributed computing systems use task replication in order to mitigate the effect of slow workers, known as stragglers. Tasks are grouped into batches and assigned to one or more workers for execution. We first consider the…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-29 Amir Behrouzi-Far , Emina Soljanin

Straggler-Resilient Federated Learning: Leveraging the Interplay Between Statistical Accuracy and System Heterogeneity

Federated Learning is a novel paradigm that involves learning from data samples distributed across a large network of clients while the data remains local. It is, however, known that federated learning is prone to multiple system challenges…

Machine Learning · Computer Science 2021-01-01 Amirhossein Reisizadeh , Isidoros Tziotis , Hamed Hassani , Aryan Mokhtari , Ramtin Pedarsani

On Delay-Optimal Scheduling in Queueing Systems with Replications

In modern computer systems, jobs are divided into short tasks and executed in parallel. Empirical observations in practical systems suggest that the task service times are highly random and the job service time is bottlenecked by the…

Performance · Computer Science 2017-02-08 Yin Sun , C. Emre Koksal , Ness B. Shroff

Straggler-Resilient Personalized Federated Learning

Federated Learning is an emerging learning paradigm that allows training models from samples distributed across a large network of clients while respecting privacy and communication restrictions. Despite its success, federated learning…

Machine Learning · Computer Science 2022-06-07 Isidoros Tziotis , Zebang Shen , Ramtin Pedarsani , Hamed Hassani , Aryan Mokhtari

Anytime MiniBatch: Exploiting Stragglers in Online Distributed Optimization

Distributed optimization is vital in solving large-scale machine learning problems. A widely-shared feature of distributed optimization techniques is the requirement that all nodes complete their assigned tasks in each computational epoch…

Machine Learning · Computer Science 2020-06-11 Nuwan Ferdinand , Haider Al-Lawati , Stark C. Draper , Matthew Nokleby

Data Replication for Reducing Computing Time in Distributed Systems with Stragglers

In distributed computing systems with stragglers, various forms of redundancy can improve the average delay performance. We study the optimal replication of data in systems where the job execution time is a stochastically decreasing and…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-01 Amir Behrouzi-Far , Emina Soljanin

Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD

Distributed Stochastic Gradient Descent (SGD) when run in a synchronous manner, suffers from delays in waiting for the slowest learners (stragglers). Asynchronous methods can alleviate stragglers, but cause gradient staleness that can…

Machine Learning · Statistics 2018-05-11 Sanghamitra Dutta , Gauri Joshi , Soumyadip Ghosh , Parijat Dube , Priya Nagpurkar

Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning

Performance of distributed optimization and learning systems is bottlenecked by "straggler" nodes and slow communication links, which significantly delay computation. We propose a distributed optimization framework where the dataset is…

Machine Learning · Statistics 2018-03-15 Can Karakus , Yifan Sun , Suhas Diggavi , Wotao Yin

Slow and Stale Gradients Can Win the Race

Distributed Stochastic Gradient Descent (SGD) when run in a synchronous manner, suffers from delays in runtime as it waits for the slowest workers (stragglers). Asynchronous methods can alleviate stragglers, but cause gradient staleness…

Machine Learning · Statistics 2020-03-25 Sanghamitra Dutta , Jianyu Wang , Gauri Joshi

Distributed Asynchronous Dual Free Stochastic Dual Coordinate Ascent

The primal-dual distributed optimization methods have broad large-scale machine learning applications. Previous primal-dual distributed methods are not applicable when the dual formulation is not available, e.g. the sum-of-non-convex…

Machine Learning · Computer Science 2017-10-30 Zhouyuan Huo , Heng Huang

DSAG: A mixed synchronous-asynchronous iterative method for straggler-resilient learning

We consider straggler-resilient learning. In many previous works, e.g., in the coded computing literature, straggling is modeled as random delays that are independent and identically distributed between workers. However, in many practical…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-30 Albin Severinson , Eirik Rosnes , Salim El Rouayheb , Alexandre Graell i Amat

99% of Distributed Optimization is a Waste of Time: The Issue and How to Fix it

Many popular distributed optimization methods for training machine learning models fit the following template: a local gradient estimate is computed independently by each worker, then communicated to a master, which subsequently performs…

Machine Learning · Computer Science 2019-06-05 Konstantin Mishchenko , Filip Hanzely , Peter Richtárik