Related papers: Parameter Database : Data-centric Synchronization …

High-Performance Distributed ML at Scale through Parameter Server Consistency Models

As Machine Learning (ML) applications increase in data size and model complexity, practitioners turn to distributed clusters to satisfy the increased computational and memory demands. Unfortunately, effective use of clusters for ML requires…

Machine Learning · Computer Science 2014-10-31 Wei Dai , Abhimanu Kumar , Jinliang Wei , Qirong Ho , Garth Gibson , Eric P. Xing

Consistent Bounded-Asynchronous Parameter Servers for Distributed ML

In distributed ML applications, shared parameters are usually replicated among computing nodes to minimize network overhead. Therefore, proper consistency model must be carefully chosen to ensure algorithm's correctness and provide high…

Machine Learning · Statistics 2014-01-03 Jinliang Wei , Wei Dai , Abhimanu Kumar , Xun Zheng , Qirong Ho , Eric P. Xing

Distributed Machine Learning through Heterogeneous Edge Systems

Many emerging AI applications request distributed machine learning (ML) among edge systems (e.g., IoT devices and PCs at the edge of the Internet), where data cannot be uploaded to a central venue for model training, due to their large…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-11-19 Hanpeng Hu , Dan Wang , Chuan Wu

Elastic Bulk Synchronous Parallel Model for Distributed Deep Learning

The bulk synchronous parallel (BSP) is a celebrated synchronization model for general-purpose parallel computing that has successfully been employed for distributed training of machine learning models. A prevalent shortcoming of the BSP is…

Machine Learning · Computer Science 2020-01-07 Xing Zhao , Manos Papagelis , Aijun An , Bao Xin Chen , Junfeng Liu , Yonggang Hu

Petuum: A New Platform for Distributed Machine Learning on Big Data

What is a systematic way to efficiently apply a wide spectrum of advanced ML programs to industrial scale problems, using Big Models (up to 100s of billions of parameters) on Big Data (up to terabytes or petabytes)? Modern parallelization…

Machine Learning · Statistics 2015-05-18 Eric P. Xing , Qirong Ho , Wei Dai , Jin Kyu Kim , Jinliang Wei , Seunghak Lee , Xun Zheng , Pengtao Xie , Abhimanu Kumar , Yaoliang Yu

Machine Learning and CPU (Central Processing Unit) Scheduling Co-Optimization over a Network of Computing Centers

In the rapidly evolving research on artificial intelligence (AI) the demand for fast, computationally efficient, and scalable solutions has increased in recent years. The problem of optimizing the computing resources for distributed machine…

Machine Learning · Computer Science 2025-10-30 Mohammadreza Doostmohammadian , Zulfiya R. Gabidullina , Hamid R. Rabiee

An Asynchronous Distributed Framework for Large-scale Learning Based on Parameter Exchanges

In many distributed learning problems, the heterogeneous loading of computing machines may harm the overall performance of synchronous strategies. In this paper, we propose an effective asynchronous distributed framework for the…

Machine Learning · Statistics 2017-05-23 Bikash Joshi , Franck Iutzeler , Massih-Reza Amini

Asynchronous Parallel Incremental Block-Coordinate Descent for Decentralized Machine Learning

Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing. For fast-increasing applications and data amounts, distributed learning is a…

Machine Learning · Computer Science 2022-02-08 Hao Chen , Yu Ye , Ming Xiao , Mikael Skoglund

Asynchronous Multi-Task Learning

Many real-world machine learning applications involve several learning tasks which are inter-related. For example, in healthcare domain, we need to learn a predictive model of a certain disease for many hospitals. The models for each…

Machine Learning · Computer Science 2016-10-03 Inci M. Baytas , Ming Yan , Anil K. Jain , Jiayu Zhou

Machine Learning for Scheduling: A Paradigm Shift from Solver-Centric to Data-Centric Approaches

Scheduling problems are a fundamental class of combinatorial optimization problems that underpin operational efficiency in manufacturing, logistics, and service systems. While operations research has traditionally developed solver-centric…

Optimization and Control · Mathematics 2026-02-03 Anbang Liu , Shaochong Lin , Jingchuan Chen , Peng Wu , Zuojun Max Shen

Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning

Deep learning is a popular machine learning technique and has been applied to many real-world problems. However, training a deep neural network is very time-consuming, especially on big data. It has become difficult for a single machine to…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-04 Xing Zhao , Aijun An , Junfeng Liu , Bao Xin Chen

Scaling Distributed Machine Learning with In-Network Aggregation

Training machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-01 Amedeo Sapio , Marco Canini , Chen-Yu Ho , Jacob Nelson , Panos Kalnis , Changhoon Kim , Arvind Krishnamurthy , Masoud Moshref , Dan R. K. Ports , Peter Richtárik

Decentralized Learning Made Easy with DecentralizePy

Decentralized learning (DL) has gained prominence for its potential benefits in terms of scalability, privacy, and fault tolerance. It consists of many nodes that coordinate without a central server and exchange millions of parameters in…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-15 Akash Dhasade , Anne-Marie Kermarrec , Rafael Pires , Rishi Sharma , Milos Vujasinovic

Distributed Multi-Task Relationship Learning

Multi-task learning aims to learn multiple tasks jointly by exploiting their relatedness to improve the generalization performance for each task. Traditionally, to perform multi-task learning, one needs to centralize data from all the tasks…

Machine Learning · Computer Science 2017-06-21 Sulin Liu , Sinno Jialin Pan , Qirong Ho

ASYNC: A Cloud Engine with Asynchrony and History for Distributed Machine Learning

ASYNC is a framework that supports the implementation of asynchrony and history for optimization methods on distributed computing platforms. The popularity of asynchronous optimization methods has increased in distributed machine learning.…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-24 Saeed Soori , Bugra Can , Mert Gurbuzbalaba , Maryam Mehri Dehnavi

Guided Model Merging for Hybrid Data Learning: Leveraging Centralized Data to Refine Decentralized Models

Current network training paradigms primarily focus on either centralized or decentralized data regimes. However, in practice, data availability often exhibits a hybrid nature, where both regimes coexist. This hybrid setting presents new…

Machine Learning · Computer Science 2025-12-01 Junyi Zhu , Ruicong Yao , Taha Ceritli , Savas Ozkan , Matthew B. Blaschko , Eunchung Noh , Jeongwon Min , Cho Jung Min , Mete Ozay

Snap ML: A Hierarchical Framework for Machine Learning

We describe a new software framework for fast training of generalized linear models. The framework, named Snap Machine Learning (Snap ML), combines recent advances in machine learning systems and algorithms in a nested manner to reflect the…

Machine Learning · Computer Science 2018-11-30 Celestine Dünner , Thomas Parnell , Dimitrios Sarigiannis , Nikolas Ioannou , Andreea Anghel , Gummadi Ravi , Madhusudanan Kandasamy , Haralampos Pozidis

A Partition-insensitive Parallel Framework for Distributed Model Fitting

Distributed model fitting refers to the process of fitting a mathematical or statistical model to the data using distributed computing resources, such that computing tasks are divided among multiple interconnected computers or nodes, often…

Computation · Statistics 2024-06-04 Xiaofei Wu , Rongmei Liang , Fabio Roli , Marcello Pelillo , Jing Yuan

A Unified Transferable Model for ML-Enhanced DBMS

Recently, the database management system (DBMS) community has witnessed the power of machine learning (ML) solutions for DBMS tasks. Despite their promising performance, these existing solutions can hardly be considered satisfactory. First,…

Databases · Computer Science 2021-11-29 Ziniu Wu , Pei Yu , Peilun Yang , Rong Zhu , Yuxing Han , Yaliang Li , Defu Lian , Kai Zeng , Jingren Zhou

Parallelizing Machine Learning as a Service for the End-User

As ML applications are becoming ever more pervasive, fully-trained systems are made increasingly available to a wide public, allowing end-users to submit queries with their own data, and to efficiently retrieve results. With increasingly…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-01 Daniela Loreti , Marco Lippi , Paolo Torroni