Related papers: Elastic Model Aggregation with Parameter Service

High-Performance Distributed ML at Scale through Parameter Server Consistency Models

As Machine Learning (ML) applications increase in data size and model complexity, practitioners turn to distributed clusters to satisfy the increased computational and memory demands. Unfortunately, effective use of clusters for ML requires…

Machine Learning · Computer Science 2014-10-31 Wei Dai , Abhimanu Kumar , Jinliang Wei , Qirong Ho , Garth Gibson , Eric P. Xing

Dynamic Parameter Allocation in Parameter Servers

To keep up with increasing dataset sizes and model complexity, distributed training has become a necessity for large machine learning tasks. Parameter servers ease the implementation of distributed parameter management---a key concern in…

Machine Learning · Computer Science 2020-07-06 Alexander Renz-Wieland , Rainer Gemulla , Steffen Zeuch , Volker Markl

NuPS: A Parameter Server for Machine Learning with Non-Uniform Parameter Access

Parameter servers (PSs) facilitate the implementation of distributed training for large machine learning tasks. In this paper, we argue that existing PSs are inefficient for tasks that exhibit non-uniform parameter access; their performance…

Databases · Computer Science 2022-03-29 Alexander Renz-Wieland , Rainer Gemulla , Zoi Kaoudi , Volker Markl

Distributed Machine Learning through Heterogeneous Edge Systems

Many emerging AI applications request distributed machine learning (ML) among edge systems (e.g., IoT devices and PCs at the edge of the Internet), where data cannot be uploaded to a central venue for model training, due to their large…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-11-19 Hanpeng Hu , Dan Wang , Chuan Wu

Performance Characterization of Distributed Deep Learning Strategies: A Quantitative Evaluation of DDP, FSDP, and Parameter Server Architectures on GPU Clusters

Efficiently scaling deep neural networks across GPU clusters requires navigating complex trade-offs between computational throughput, memory utilization, and synchronization overhead. This paper presents a unified empirical evaluation of…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-06 Md Sultanul Islam Ovi

Empirical Study of Straggler Problem in Parameter Server on Iterative Convergent Distributed Machine Learning

The purpose of this study is to test the effectiveness of current straggler mitigation techniques over different important iterative convergent machine learning(ML) algorithm including Matrix Factorization (MF), Multinomial Logistic…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-31 Benjamin Wong

Consistent Bounded-Asynchronous Parameter Servers for Distributed ML

In distributed ML applications, shared parameters are usually replicated among computing nodes to minimize network overhead. Therefore, proper consistency model must be carefully chosen to ensure algorithm's correctness and provide high…

Machine Learning · Statistics 2014-01-03 Jinliang Wei , Wei Dai , Abhimanu Kumar , Xun Zheng , Qirong Ho , Eric P. Xing

Architecting Peer-to-Peer Serverless Distributed Machine Learning Training for Improved Fault Tolerance

Distributed Machine Learning refers to the practice of training a model on multiple computers or devices that can be called nodes. Additionally, serverless computing is a new paradigm for cloud computing that uses functions as a…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-02-28 Amine Barrak , Fabio Petrillo , Fehmi Jaafar

Towards Self-Tuning Parameter Servers

Recent years, many applications have been driven advances by the use of Machine Learning (ML). Nowadays, it is common to see industrial-strength machine learning jobs that involve millions of model parameters, terabytes of training data,…

Databases · Computer Science 2020-08-05 Chris Liu , Pengfei Zhang , Bo Tang , Hang Shen , Lei Zhu , Ziliang Lai , Eric Lo

A Comparative Measurement Study of Deep Learning as a Service Framework

Big data powered Deep Learning (DL) and its applications have blossomed in recent years, fueled by three technological trends: a large amount of digitized data openly accessible, a growing number of DL software frameworks in open source and…

Performance · Computer Science 2019-08-20 Yanzhao Wu , Ling Liu , Calton Pu , Wenqi Cao , Semih Sahin , Wenqi Wei , Qi Zhang

Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning

Deep learning is a popular machine learning technique and has been applied to many real-world problems. However, training a deep neural network is very time-consuming, especially on big data. It has become difficult for a single machine to…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-04 Xing Zhao , Aijun An , Junfeng Liu , Bao Xin Chen

Demeter: Resource-Efficient Distributed Stream Processing under Dynamic Loads with Multi-Configuration Optimization

Distributed Stream Processing (DSP) focuses on the near real-time processing of large streams of unbounded data. To increase processing capacities, DSP systems are able to dynamically scale across a cluster of commodity nodes, ensuring a…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-05 Morgan Geldenhuys , Dominik Scheinert , Odej Kao , Lauritz Thamsen

Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers

Motivated by extreme multi-label classification applications, we consider training deep learning models over sparse data in multi-GPU servers. The variance in the number of non-zero features across training batches and the intrinsic GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-15 Yujing Ma , Florin Rusu , Kesheng Wu , Alexander Sim

Parameter Box: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training

Most work in the deep learning systems community has focused on faster inference, but arriving at a trained model requires lengthy experiments. Accelerating training lets developers iterate faster and come up with better models. DNN…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-22 Liang Luo , Jacob Nelson , Luis Ceze , Amar Phanishayee , Arvind Krishnamurthy

Boosting Distributed Machine Learning Training Through Loss-tolerant Transmission Protocol

Distributed Machine Learning (DML) systems are utilized to enhance the speed of model training in data centers (DCs) and edge nodes. The Parameter Server (PS) communication architecture is commonly employed, but it faces severe long-tail…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-15 Zixuan Chen , Lei Shi , Xuandong Liu , Xin Ai , Sen Liu , Yang Xu

Effective Elastic Scaling of Deep Learning Workloads

The increased use of deep learning (DL) in academia, government and industry has, in turn, led to the popularity of on-premise and cloud-hosted deep learning platforms, whose goals are to enable organizations utilize expensive resources…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-25 Vaibhav Saxena , K. R. Jayaram , Saurav Basu , Yogish Sabharwal , Ashish Verma

MXNET-MPI: Embedding MPI parallelism in Parameter Server Task Model for scaling Deep Learning

Existing Deep Learning frameworks exclusively use either Parameter Server(PS) approach or MPI parallelism. In this paper, we discuss the drawbacks of such approaches and propose a generic framework supporting both PS and MPI programming…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-12 Amith R Mamidala , Georgios Kollias , Chris Ward , Fausto Artico

Activated Parameter Locating via Causal Intervention for Model Merging

Model merging combines multiple homologous models into one model, achieving convincing generalization without the necessity of additional training. A key challenge in this problem is resolving parameter redundancies and conflicts across…

Computation and Language · Computer Science 2024-08-20 Fanshuang Kong , Richong Zhang , Ziqiao Wang

Elastic deep learning in multi-tenant GPU cluster

We study how to support elasticity, i.e., the ability to dynamically adjust the parallelism (number of GPUs), for deep neural network (DNN) training. Elasticity can benefit multi-tenant GPU cluster management in many ways, e.g., achieving…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-12-03 Yidi Wu , Kaihao Ma , Xiao Yan , Zhi Liu , Zhenkun Cai , Yuzhen Huang , James Cheng , Han Yuan , Fan Yu

Achieving Efficient Distributed Machine Learning Using a Novel Non-Linear Class of Aggregation Functions

Distributed machine learning (DML) over time-varying networks can be an enabler for emerging decentralized ML applications such as autonomous driving and drone fleeting. However, the commonly used weighted arithmetic mean model aggregation…

Machine Learning · Computer Science 2022-02-22 Haizhou Du , Ryan Yang , Yijian Chen , Qiao Xiang , Andre Wibisono , Wei Huang