Related papers: Consistent Bounded-Asynchronous Parameter Servers …

High-Performance Distributed ML at Scale through Parameter Server Consistency Models

As Machine Learning (ML) applications increase in data size and model complexity, practitioners turn to distributed clusters to satisfy the increased computational and memory demands. Unfortunately, effective use of clusters for ML requires…

Machine Learning · Computer Science 2014-10-31 Wei Dai , Abhimanu Kumar , Jinliang Wei , Qirong Ho , Garth Gibson , Eric P. Xing

Consistency models in distributed systems: A survey on definitions, disciplines, challenges and applications

The replication mechanism resolves some challenges with big data such as data durability, data access, and fault tolerance. Yet, replication itself gives birth to another challenge known as the consistency in distributed systems.…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-02-12 Hesam Nejati Sharif Aldin , Hossein Deldari , Mohammad Hossein Moattar , Mostafa Razavi Ghods

Distributed Machine Learning through Heterogeneous Edge Systems

Many emerging AI applications request distributed machine learning (ML) among edge systems (e.g., IoT devices and PCs at the edge of the Internet), where data cannot be uploaded to a central venue for model training, due to their large…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-11-19 Hanpeng Hu , Dan Wang , Chuan Wu

Parameter Database : Data-centric Synchronization for Scalable Machine Learning

We propose a new data-centric synchronization framework for carrying out of machine learning (ML) tasks in a distributed environment. Our framework exploits the iterative nature of ML algorithms and relaxes the application agnostic bulk…

Databases · Computer Science 2015-08-06 Naman Goel , Divyakant Agrawal , Sanjay Chawla , Ahmed Elmagarmid

An Asynchronous Distributed Framework for Large-scale Learning Based on Parameter Exchanges

In many distributed learning problems, the heterogeneous loading of computing machines may harm the overall performance of synchronous strategies. In this paper, we propose an effective asynchronous distributed framework for the…

Machine Learning · Statistics 2017-05-23 Bikash Joshi , Franck Iutzeler , Massih-Reza Amini

Starting a Dialog between Model Checking and Fault-tolerant Distributed Algorithms

Fault-tolerant distributed algorithms are central for building reliable spatially distributed systems. Unfortunately, the lack of a canonical precise framework for fault-tolerant algorithms is an obstacle for both verification and…

Formal Languages and Automata Theory · Computer Science 2012-10-16 Annu John , Igor Konnov , Ulrich Schmid , Helmut Veith , Josef Widder

Machine Learning and CPU (Central Processing Unit) Scheduling Co-Optimization over a Network of Computing Centers

In the rapidly evolving research on artificial intelligence (AI) the demand for fast, computationally efficient, and scalable solutions has increased in recent years. The problem of optimizing the computing resources for distributed machine…

Machine Learning · Computer Science 2025-10-30 Mohammadreza Doostmohammadian , Zulfiya R. Gabidullina , Hamid R. Rabiee

Convergence of Distributed Stochastic Variance Reduced Methods without Sampling Extra Data

Stochastic variance reduced methods have gained a lot of interest recently for empirical risk minimization due to its appealing run time complexity. When the data size is large and disjointly stored on different machines, it becomes…

Machine Learning · Computer Science 2020-08-26 Shicong Cen , Huishuai Zhang , Yuejie Chi , Wei Chen , Tie-Yan Liu

A Partition-insensitive Parallel Framework for Distributed Model Fitting

Distributed model fitting refers to the process of fitting a mathematical or statistical model to the data using distributed computing resources, such that computing tasks are divided among multiple interconnected computers or nodes, often…

Computation · Statistics 2024-06-04 Xiaofei Wu , Rongmei Liang , Fabio Roli , Marcello Pelillo , Jing Yuan

Empirical Study of Straggler Problem in Parameter Server on Iterative Convergent Distributed Machine Learning

The purpose of this study is to test the effectiveness of current straggler mitigation techniques over different important iterative convergent machine learning(ML) algorithm including Matrix Factorization (MF), Multinomial Logistic…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-31 Benjamin Wong

Deterministic Consistency: A Programming Model for Shared Memory Parallelism

The difficulty of developing reliable parallel software is generating interest in deterministic environments, where a given program and input can yield only one possible result. Languages or type systems can enforce determinism in new code,…

Operating Systems · Computer Science 2010-02-01 Amittai Aviram , Bryan Ford

Toward Understanding the Impact of Staleness in Distributed Machine Learning

Many distributed machine learning (ML) systems adopt the non-synchronous execution in order to alleviate the network communication bottleneck, resulting in stale parameters that do not reflect the latest updates. Despite much development in…

Machine Learning · Computer Science 2018-10-09 Wei Dai , Yi Zhou , Nanqing Dong , Hao Zhang , Eric P. Xing

High Performance Latent Variable Models

Latent variable models have accumulated a considerable amount of interest from the industry and academia for their versatility in a wide range of applications. A large amount of effort has been made to develop systems that is able to extend…

Machine Learning · Computer Science 2015-11-19 Aaron Q. Li , Amr Ahmed , Mu Li , Vanja Josifovski

Randomized Constraints Consensus for Distributed Robust Linear Programming

In this paper we consider a network of processors aiming at cooperatively solving linear programming problems subject to uncertainty. Each node only knows a common cost function and its local uncertain constraint set. We propose a…

Optimization and Control · Mathematics 2019-08-27 Mohammadreza Chamanbaz , Giuseppe Notarstefano , Roland Bouffanais

Distributed Learning over Unreliable Networks

Most of today's distributed machine learning systems assume {\em reliable networks}: whenever two machines exchange information (e.g., gradients or models), the network should guarantee the delivery of the message. At the same time, recent…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-05-17 Chen Yu , Hanlin Tang , Cedric Renggli , Simon Kassing , Ankit Singla , Dan Alistarh , Ce Zhang , Ji Liu

Analysis and Implementation of an Asynchronous Optimization Algorithm for the Parameter Server

This paper presents an asynchronous incremental aggregated gradient algorithm and its implementation in a parameter server framework for solving regularized optimization problems. The algorithm can handle both general convex (possibly…

Optimization and Control · Mathematics 2016-10-19 Arda Aytekin , Hamid Reza Feyzmahdavian , Mikael Johansson

Formal Definitions and Performance Comparison of Consistency Models for Parallel File Systems

The semantics of HPC storage systems are defined by the consistency models to which they abide. Storage consistency models have been less studied than their counterparts in memory systems, with the exception of the POSIX standard and its…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-03 Chen Wang , Kathryn Mohror , Marc Snir

Dynamic Parameter Allocation in Parameter Servers

To keep up with increasing dataset sizes and model complexity, distributed training has become a necessity for large machine learning tasks. Parameter servers ease the implementation of distributed parameter management---a key concern in…

Machine Learning · Computer Science 2020-07-06 Alexander Renz-Wieland , Rainer Gemulla , Steffen Zeuch , Volker Markl

Analysis of Distributed Algorithms for Big-data

The parallel and distributed processing are becoming de facto industry standard, and a large part of the current research is targeted on how to make computing scalable and distributed, dynamically, without allocating the resources on…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-10 Rajendra Purohit , K R Chowdhary , S D Purohit

Diffusion LMS for clustered multitask networks

Recent research works on distributed adaptive networks have intensively studied the case where the nodes estimate a common parameter vector collaboratively. However, there are many applications that are multitask-oriented in the sense that…

Systems and Control · Computer Science 2013-11-04 Jie Chen , Cédric Richard , Ali Sayed