Related papers: High-Performance Distributed ML at Scale through P…

Consistent Bounded-Asynchronous Parameter Servers for Distributed ML

In distributed ML applications, shared parameters are usually replicated among computing nodes to minimize network overhead. Therefore, proper consistency model must be carefully chosen to ensure algorithm's correctness and provide high…

Machine Learning · Statistics 2014-01-03 Jinliang Wei , Wei Dai , Abhimanu Kumar , Xun Zheng , Qirong Ho , Eric P. Xing

Boosting Distributed Machine Learning Training Through Loss-tolerant Transmission Protocol

Distributed Machine Learning (DML) systems are utilized to enhance the speed of model training in data centers (DCs) and edge nodes. The Parameter Server (PS) communication architecture is commonly employed, but it faces severe long-tail…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-15 Zixuan Chen , Lei Shi , Xuandong Liu , Xin Ai , Sen Liu , Yang Xu

Towards Self-Tuning Parameter Servers

Recent years, many applications have been driven advances by the use of Machine Learning (ML). Nowadays, it is common to see industrial-strength machine learning jobs that involve millions of model parameters, terabytes of training data,…

Databases · Computer Science 2020-08-05 Chris Liu , Pengfei Zhang , Bo Tang , Hang Shen , Lei Zhu , Ziliang Lai , Eric Lo

Parameter Database : Data-centric Synchronization for Scalable Machine Learning

We propose a new data-centric synchronization framework for carrying out of machine learning (ML) tasks in a distributed environment. Our framework exploits the iterative nature of ML algorithms and relaxes the application agnostic bulk…

Databases · Computer Science 2015-08-06 Naman Goel , Divyakant Agrawal , Sanjay Chawla , Ahmed Elmagarmid

Elastic Model Aggregation with Parameter Service

Model aggregation, the process that updates model parameters, is an important step for model convergence in distributed deep learning (DDL). However, the parameter server (PS), a popular paradigm of performing model aggregation, causes CPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-04-08 Juncheng Gu , Mosharaf Chowdhury , Kang G. Shin , Aditya Akella

Distributed Machine Learning through Heterogeneous Edge Systems

Many emerging AI applications request distributed machine learning (ML) among edge systems (e.g., IoT devices and PCs at the edge of the Internet), where data cannot be uploaded to a central venue for model training, due to their large…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-11-19 Hanpeng Hu , Dan Wang , Chuan Wu

Strategies and Principles of Distributed Machine Learning on Big Data

The rise of Big Data has led to new demands for Machine Learning (ML) systems to learn complex models with millions to billions of parameters, that promise adequate capacity to digest massive datasets and offer powerful predictive analytics…

Machine Learning · Statistics 2016-01-01 Eric P. Xing , Qirong Ho , Pengtao Xie , Wei Dai

Machine Learning and CPU (Central Processing Unit) Scheduling Co-Optimization over a Network of Computing Centers

In the rapidly evolving research on artificial intelligence (AI) the demand for fast, computationally efficient, and scalable solutions has increased in recent years. The problem of optimizing the computing resources for distributed machine…

Machine Learning · Computer Science 2025-10-30 Mohammadreza Doostmohammadian , Zulfiya R. Gabidullina , Hamid R. Rabiee

MXNET-MPI: Embedding MPI parallelism in Parameter Server Task Model for scaling Deep Learning

Existing Deep Learning frameworks exclusively use either Parameter Server(PS) approach or MPI parallelism. In this paper, we discuss the drawbacks of such approaches and propose a generic framework supporting both PS and MPI programming…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-12 Amith R Mamidala , Georgios Kollias , Chris Ward , Fausto Artico

Dynamic Parameter Allocation in Parameter Servers

To keep up with increasing dataset sizes and model complexity, distributed training has become a necessity for large machine learning tasks. Parameter servers ease the implementation of distributed parameter management---a key concern in…

Machine Learning · Computer Science 2020-07-06 Alexander Renz-Wieland , Rainer Gemulla , Steffen Zeuch , Volker Markl

NuPS: A Parameter Server for Machine Learning with Non-Uniform Parameter Access

Parameter servers (PSs) facilitate the implementation of distributed training for large machine learning tasks. In this paper, we argue that existing PSs are inefficient for tasks that exhibit non-uniform parameter access; their performance…

Databases · Computer Science 2022-03-29 Alexander Renz-Wieland , Rainer Gemulla , Zoi Kaoudi , Volker Markl

High Performance Latent Variable Models

Latent variable models have accumulated a considerable amount of interest from the industry and academia for their versatility in a wide range of applications. A large amount of effort has been made to develop systems that is able to extend…

Machine Learning · Computer Science 2015-11-19 Aaron Q. Li , Amr Ahmed , Mu Li , Vanja Josifovski

Performance Characterization of Distributed Deep Learning Strategies: A Quantitative Evaluation of DDP, FSDP, and Parameter Server Architectures on GPU Clusters

Efficiently scaling deep neural networks across GPU clusters requires navigating complex trade-offs between computational throughput, memory utilization, and synchronization overhead. This paper presents a unified empirical evaluation of…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-06 Md Sultanul Islam Ovi

Outlook Towards Deployable Continual Learning for Particle Accelerators

Particle Accelerators are high power complex machines. To ensure uninterrupted operation of these machines, thousands of pieces of equipment need to be synchronized, which requires addressing many challenges including design, optimization…

Machine Learning · Computer Science 2025-04-08 Kishansingh Rajput , Sen Lin , Auralee Edelen , Willem Blokland , Malachi Schram

Petuum: A New Platform for Distributed Machine Learning on Big Data

What is a systematic way to efficiently apply a wide spectrum of advanced ML programs to industrial scale problems, using Big Models (up to 100s of billions of parameters) on Big Data (up to terabytes or petabytes)? Modern parallelization…

Machine Learning · Statistics 2015-05-18 Eric P. Xing , Qirong Ho , Wei Dai , Jin Kyu Kim , Jinliang Wei , Seunghak Lee , Xun Zheng , Pengtao Xie , Abhimanu Kumar , Yaoliang Yu

Empirical Study of Straggler Problem in Parameter Server on Iterative Convergent Distributed Machine Learning

The purpose of this study is to test the effectiveness of current straggler mitigation techniques over different important iterative convergent machine learning(ML) algorithm including Matrix Factorization (MF), Multinomial Logistic…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-31 Benjamin Wong

Dependable Distributed Training of Compressed Machine Learning Models

The existing work on the distributed training of machine learning (ML) models has consistently overlooked the distribution of the achieved learning quality, focusing instead on its average value. This leads to a poor dependability}of the…

Machine Learning · Computer Science 2024-02-23 Francesco Malandrino , Giuseppe Di Giacomo , Marco Levorato , Carla Fabiana Chiasserini

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-27 Zhongyi Lin , Ning Sun , Pallab Bhattacharya , Xizhou Feng , Louis Feng , John D. Owens

Real-Time Machine Learning: The Missing Pieces

Machine learning applications are increasingly deployed not only to serve predictions using static models, but also as tightly-integrated components of feedback loops involving dynamic, real-time decision making. These applications pose a…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-05-23 Robert Nishihara , Philipp Moritz , Stephanie Wang , Alexey Tumanov , William Paul , Johann Schleier-Smith , Richard Liaw , Mehrdad Niknami , Michael I. Jordan , Ion Stoica

Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning

Deep learning is a popular machine learning technique and has been applied to many real-world problems. However, training a deep neural network is very time-consuming, especially on big data. It has become difficult for a single machine to…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-04 Xing Zhao , Aijun An , Junfeng Liu , Bao Xin Chen