Aditya Ramamoorthy
Large-scale distributed learning aims at minimizing a loss function $L$ that depends on a training dataset with respect to a $d$-length parameter vector. The distributed cluster typically consists of a parameter server (PS) and multiple…
We consider the problem of zero-error function computation with side information. Alice and Bob have correlated sources $X,Y$ with joint p.m.f. $p_{XY}(\cdot, \cdot)$. Bob wants to calculate $f(X,Y)$ with zero error. Alice encodes…
Distributed computing systems are well-known to suffer from the problem of slow or failed nodes; these are referred to as stragglers. Straggler mitigation (for distributed matrix computations) has recently been investigated from the…
Within distributed learning, workers typically compute gradients on their assigned dataset chunks and send them to the parameter server (PS), which aggregates them to compute either an exact or approximate version of $\nabla L$ (gradient of…
Matrix computations are a fundamental building-block of edge computing systems, with a major recent uptick in demand due to their use in AI/ML training and inference procedures. Existing approaches for distributing matrix computations…
We consider a quantum and classical version multi-party function computation problem with $n$ players, where players $2, \dots, n$ need to communicate appropriate information to player 1, so that a "generalized" inner product function with…
Straggler nodes are well-known bottlenecks of distributed matrix computations which induce reductions in computation/communication speeds. A common strategy for mitigating such stragglers is to incorporate Reed-Solomon based MDS (maximum…
Existing approaches to distributed matrix computations involve allocating coded combinations of submatrices to worker nodes, to build resilience to stragglers and/or enhance privacy. In this study, we consider the challenge of preserving…
A plethora of modern machine learning tasks require the utilization of large-scale distributed clusters as a critical component of the training pipeline. However, abnormal Byzantine behavior of the worker nodes can derail the training and…
Polynomial based approaches, such as the Mat-Dot and entangled polynomial codes (EPC) have been used extensively within coded matrix computations to obtain schemes with good recovery thresholds. However, these schemes are well-recognized to…
Federated learning (FL) is a popular technique for training a global model on data distributed across client devices. Like other distributed training techniques, FL is susceptible to straggler (slower or failed) clients. Recent work has…
In this work we study the problem of Subspace Tracking with missing data (ST-miss) and outliers (Robust ST-miss). We propose a novel algorithm, and provide a guarantee for both these problems. Unlike past work on this topic, the current…
The overall execution time of distributed matrix computations is often dominated by slow worker nodes (stragglers) within the clusters. Recently, different coding techniques have been utilized to mitigate the effect of stragglers where…
State-of-the-art machine learning models are routinely trained on large-scale distributed clusters. Crucially, such systems can be compromised when some of the computing devices exhibit abnormal (Byzantine) behavior and return arbitrary…
Many big data algorithms executed on MapReduce-like systems have a shuffle phase that often dominates the overall job execution time. Recent work has demonstrated schemes where the communication load in the shuffle phase can be traded off…
Distributed matrix computations over large clusters can suffer from the problem of slow or failed worker nodes (called stragglers) which can dominate the overall job execution time. Coded computation utilizes concepts from erasure coding to…
Polynomial based methods have recently been used in several works for mitigating the effect of stragglers (slow or failed nodes) in distributed matrix computations. For a system with $n$ worker nodes where $s$ can be stragglers, these…
Training of large scale models on distributed clusters is a critical component of the machine learning pipeline. However, this training can easily be made to fail if some workers behave in an adversarial (Byzantine) fashion whereby they…
Coded caching is a technique that promises huge reductions in network traffic in content-delivery networks. However, the original formulation and several subsequent contributions in the area, assume that the file requests from the users are…
Distributed matrix computations -- matrix-matrix or matrix-vector multiplications -- are well-recognized to suffer from the problem of stragglers (slow or failed worker nodes). Much of prior work in this area is (i) either sub-optimal in…