Related papers: CodedReduce: A Fast and Robust Framework for Gradi…

Gradient Coding with Dynamic Clustering for Straggler-Tolerant Distributed Learning

Distributed implementations are crucial in speeding up large scale machine learning applications. Distributed gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers. A…

Information Theory · Computer Science 2021-03-02 Baturalp Buyukates , Emre Ozfatura , Sennur Ulukus , Deniz Gunduz

Gradient Coding with Clustering and Multi-message Communication

Gradient descent (GD) methods are commonly employed in machine learning problems to optimize the parameters of the model in an iterative fashion. For problems with massive datasets, computations are distributed to many parallel computing…

Information Theory · Computer Science 2019-03-06 Emre Ozfatura , Deniz Gunduz , Sennur Ulukus

Communication-Efficient Approximate Gradient Coding for Distributed Learning in Heterogeneous Systems

We propose a communication-efficient optimally structured gradient coding scheme to jointly address straggler resilience and communication efficiency in heterogeneous distributed learning. By establishing a unified framework that…

Systems and Control · Electrical Eng. & Systems 2026-05-18 Heekang Song , Wan Choi

Sequential Gradient Coding For Straggler Mitigation

In distributed computing, slower nodes (stragglers) usually become a bottleneck. Gradient Coding (GC), introduced by Tandon et al., is an efficient technique that uses principles of error-correcting codes to distribute gradient computation…

Machine Learning · Computer Science 2023-06-29 M. Nikhil Krishnan , MohammadReza Ebrahimi , Ashish Khisti

Communication-Efficient Gradient Coding for Straggler Mitigation in Distributed Learning

Distributed implementations of gradient-based methods, wherein a server distributes gradient computations across worker machines, need to overcome two limitations: delays caused by slow running machines called 'stragglers', and…

Information Theory · Computer Science 2020-05-15 Swanand Kadhe , O. Ozan Koyluoglu , Kannan Ramchandran

CoDGraD: A Code-based Distributed Gradient Descent Scheme for Decentralized Convex Optimization

In this paper, we consider a large network containing many regions such that each region is equipped with a worker with some data processing and communication capability. For such a network, some workers may become stragglers due to the…

Systems and Control · Electrical Eng. & Systems 2022-04-14 Elie Atallah , Nazanin Rahnavard , Qiyu Sun

GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training

Data parallelism can boost the training speed of convolutional neural networks (CNN), but could suffer from significant communication costs caused by gradient aggregation. To alleviate this problem, several scalar quantization techniques…

Machine Learning · Computer Science 2019-01-01 Mingchao Yu , Zhifeng Lin , Krishna Narra , Songze Li , Youjie Li , Nam Sung Kim , Alexander Schwing , Murali Annavaram , Salman Avestimehr

GADGET: Online Resource Optimization for Scheduling Ring-All-Reduce Learning Jobs

Fueled by advances in distributed deep learning (DDL), recent years have witnessed a rapidly growing demand for resource-intensive distributed/parallel computing to process DDL computing jobs. To resolve network communication bottleneck and…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-03 Menglu Yu , Ye Tian , Bo Ji , Chuan Wu , Hridesh Rajan , Jia Liu

Gradient Coding with Dynamic Clustering for Straggler Mitigation

In distributed synchronous gradient descent (GD) the main performance bottleneck for the per-iteration completion time is the slowest \textit{straggling} workers. To speed up GD iterations in the presence of stragglers, coded distributed…

Information Theory · Computer Science 2020-11-04 Baturalp Buyukates , Emre Ozfatura , Sennur Ulukus , Deniz Gunduz

Flexible Communication for Optimal Distributed Learning over Unpredictable Networks

Gradient compression alleviates expensive communication in distributed deep learning by sending fewer values and its corresponding indices, typically via Allgather (AG). Training with high compression ratio (CR) achieves high accuracy like…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-01-30 Sahil Tyagi , Martin Swany

RedSync : Reducing Synchronization Traffic for Distributed Deep Learning

Data parallelism has become a dominant method to scale Deep Neural Network (DNN) training across multiple nodes. Since synchronizing a large number of gradients of the local model can be a bottleneck for large-scale distributed training,…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-07-23 Jiarui Fang , Haohuan Fu , Guangwen Yang , Cho-Jui Hsieh

On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters with Communication Contention

Powered by advances in deep learning (DL) techniques, machine learning and artificial intelligence have achieved astonishing successes. However, the rapidly growing needs for DL also led to communication- and resource-intensive distributed…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-16 Menglu Yu , Bo Ji , Hridesh Rajan , Jia Liu

Near-Optimal Straggler Mitigation for Distributed Gradient Methods

Modern learning algorithms use gradient descent updates to train inferential models that best explain data. Scaling these approaches to massive data sizes requires proper distributed gradient descent schemes where distributed worker nodes…

Information Theory · Computer Science 2017-10-30 Songze Li , Seyed Mohammadreza Mousavi Kalan , A. Salman Avestimehr , Mahdi Soltanolkotabi

Communication-Computation Efficient Gradient Coding

This paper develops coding techniques to reduce the running time of distributed learning tasks. It characterizes the fundamental tradeoff to compute gradients (and more generally vector summations) in terms of three parameters: computation…

Machine Learning · Statistics 2018-02-13 Min Ye , Emmanuel Abbe

Communication-Efficient Approximate Gradient Coding

Large-scale distributed learning aims at minimizing a loss function $L$ that depends on a training dataset with respect to a $d$-length parameter vector. The distributed cluster typically consists of a parameter server (PS) and multiple…

Information Theory · Computer Science 2026-03-25 Sifat Munim , Aditya Ramamoorthy

Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning

Performance of distributed optimization and learning systems is bottlenecked by "straggler" nodes and slow communication links, which significantly delay computation. We propose a distributed optimization framework where the dataset is…

Machine Learning · Statistics 2018-03-15 Can Karakus , Yifan Sun , Suhas Diggavi , Wotao Yin

Coded Computing for Distributed Graph Analytics

Performance of distributed graph processing systems significantly suffers from 'communication bottleneck' as a large number of messages are exchanged among servers at each step of the computation. Motivated by graph based MapReduce, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-11 Saurav Prakash , Amirhossein Reisizadeh , Ramtin Pedarsani , Amir Salman Avestimehr

Distributed Stochastic Gradient Descent Using LDGM Codes

We consider a distributed learning problem in which the computation is carried out on a system consisting of a master node and multiple worker nodes. In such systems, the existence of slow-running machines called stragglers will cause a…

Information Theory · Computer Science 2019-01-16 Shunsuke Horii , Takahiro Yoshida , Manabu Kobayashi , Toshiyasu Matsushima

Stochastic Gradient Coding for Straggler Mitigation in Distributed Learning

We consider distributed gradient descent in the presence of stragglers. Recent work on \em gradient coding \em and \em approximate gradient coding \em have shown how to add redundancy in distributed gradient descent to guarantee convergence…

Information Theory · Computer Science 2019-05-15 Rawad Bitar , Mary Wootters , Salim El Rouayheb

S2 Reducer: High-Performance Sparse Communication to Accelerate Distributed Deep Learning

Distributed stochastic gradient descent (SGD) approach has been widely used in large-scale deep learning, and the gradient collective method is vital to ensure the training scalability of the distributed deep learning system. Collective…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-29 Keshi Ge , Yongquan Fu , Zhiquan Lai , Xiaoge Deng , Dongsheng Li