Related papers: Dynamic Parameter Allocation in Parameter Servers

NuPS: A Parameter Server for Machine Learning with Non-Uniform Parameter Access

Parameter servers (PSs) facilitate the implementation of distributed training for large machine learning tasks. In this paper, we argue that existing PSs are inefficient for tasks that exhibit non-uniform parameter access; their performance…

Databases · Computer Science 2022-03-29 Alexander Renz-Wieland , Rainer Gemulla , Zoi Kaoudi , Volker Markl

Elastic Model Aggregation with Parameter Service

Model aggregation, the process that updates model parameters, is an important step for model convergence in distributed deep learning (DDL). However, the parameter server (PS), a popular paradigm of performing model aggregation, causes CPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-04-08 Juncheng Gu , Mosharaf Chowdhury , Kang G. Shin , Aditya Akella

Scaling Multi-Agent Reinforcement Learning with Selective Parameter Sharing

Sharing parameters in multi-agent deep reinforcement learning has played an essential role in allowing algorithms to scale to a large number of agents. Parameter sharing between agents significantly decreases the number of trainable…

Multiagent Systems · Computer Science 2021-06-15 Filippos Christianos , Georgios Papoudakis , Arrasy Rahman , Stefano V. Albrecht

Architecting Peer-to-Peer Serverless Distributed Machine Learning Training for Improved Fault Tolerance

Distributed Machine Learning refers to the practice of training a model on multiple computers or devices that can be called nodes. Additionally, serverless computing is a new paradigm for cloud computing that uses functions as a…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-02-28 Amine Barrak , Fabio Petrillo , Fehmi Jaafar

Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning

Deep learning is a popular machine learning technique and has been applied to many real-world problems. However, training a deep neural network is very time-consuming, especially on big data. It has become difficult for a single machine to…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-04 Xing Zhao , Aijun An , Junfeng Liu , Bao Xin Chen

Online Job Scheduling in Distributed Machine Learning Clusters

Nowadays large-scale distributed machine learning systems have been deployed to support various analytics and intelligence services in IT firms. To train a large dataset and derive the prediction/inference model, e.g., a deep neural…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-04 Yixin Bao , Yanghua Peng , Chuan Wu , Zongpeng Li

High-Performance Distributed ML at Scale through Parameter Server Consistency Models

As Machine Learning (ML) applications increase in data size and model complexity, practitioners turn to distributed clusters to satisfy the increased computational and memory demands. Unfortunately, effective use of clusters for ML requires…

Machine Learning · Computer Science 2014-10-31 Wei Dai , Abhimanu Kumar , Jinliang Wei , Qirong Ho , Garth Gibson , Eric P. Xing

Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned Edge Learning Over Broadband Channels

In this paper, we consider partitioned edge learning (PARTEL), which implements parameter-server training, a well known distributed learning method, in a wireless network. Thereby, PARTEL leverages distributed computation resources at edge…

Information Theory · Computer Science 2021-03-19 Dingzhu Wen , Ki-Jun Jeon , Mehdi Bennis , Kaibin Huang

Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training

Distributed deep neural network (DDNN) training constitutes an increasingly important workload that frequently runs in the cloud. Larger DNN models and faster compute engines are shifting DDNN training bottlenecks from computation to…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-22 Liang Luo , Jacob Nelson , Luis Ceze , Amar Phanishayee , Arvind Krishnamurthy

Distributed Optimization for Client-Server Architecture with Negative Gradient Weights

Availability of both massive datasets and computing resources have made machine learning and predictive analytics extremely pervasive. In this work we present a synchronous algorithm and architecture for distributed optimization motivated…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-12-20 Shripad Gade , Nitin H. Vaidya

Dividable Configuration Performance Learning

Machine/deep learning models have been widely adopted for predicting the configuration performance of software systems. However, a crucial yet unaddressed challenge is how to cater for the sparsity inherited from the configuration…

Software Engineering · Computer Science 2024-11-21 Jingzhi Gong , Tao Chen , Rami Bahsoon

Consistent Bounded-Asynchronous Parameter Servers for Distributed ML

In distributed ML applications, shared parameters are usually replicated among computing nodes to minimize network overhead. Therefore, proper consistency model must be carefully chosen to ensure algorithm's correctness and provide high…

Machine Learning · Statistics 2014-01-03 Jinliang Wei , Wei Dai , Abhimanu Kumar , Xun Zheng , Qirong Ho , Eric P. Xing

Dynamic backup workers for parallel machine learning

The most popular framework for distributed training of machine learning models is the (synchronous) parameter server (PS). This paradigm consists of $n$ workers, which iteratively compute updates of the model parameters, and a stateful PS,…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-01-26 Chuan Xu , Giovanni Neglia , Nicola Sebastianelli

Stanza: Layer Separation for Distributed Training in Deep Learning

The parameter server architecture is prevalently used for distributed deep learning. Each worker machine in a parameter server system trains the complete model, which leads to a hefty amount of network data transfer between workers and…

Machine Learning · Computer Science 2019-01-11 Xiaorui Wu , Hong Xu , Bo Li , Yongqiang Xiong

Boosting Distributed Machine Learning Training Through Loss-tolerant Transmission Protocol

Distributed Machine Learning (DML) systems are utilized to enhance the speed of model training in data centers (DCs) and edge nodes. The Parameter Server (PS) communication architecture is commonly employed, but it faces severe long-tail…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-15 Zixuan Chen , Lei Shi , Xuandong Liu , Xin Ai , Sen Liu , Yang Xu

Graph Partitioning via Parallel Submodular Approximation to Accelerate Distributed Machine Learning

Distributed computing excels at processing large scale data, but the communication cost for synchronizing the shared parameters may slow down the overall performance. Fortunately, the interactions between parameter and data in many problems…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-05-19 Mu Li , Dave G. Andersen , Alexander J. Smola

PAL -- Parallel active learning for machine-learned potentials

Constructing datasets representative of the target domain is essential for training effective machine learning models. Active learning (AL) is a promising method that iteratively extends training data to enhance model performance while…

Machine Learning · Computer Science 2024-12-03 Chen Zhou , Marlen Neubert , Yuri Koide , Yumeng Zhang , Van-Quan Vuong , Tobias Schlöder , Stefanie Dehnen , Pascal Friederich

Distributed Training Large-Scale Deep Architectures

Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-09-21 Shang-Xuan Zou , Chun-Yen Chen , Jui-Lin Wu , Chun-Nan Chou , Chia-Chin Tsao , Kuan-Chieh Tung , Ting-Wei Lin , Cheng-Lung Sung , Edward Y. Chang

Distributed linear regression by averaging

Distributed statistical learning problems arise commonly when dealing with large datasets. In this setup, datasets are partitioned over machines, which compute locally, and communicate short messages. Communication is often the bottleneck.…

Statistics Theory · Mathematics 2022-10-25 Edgar Dobriban , Yue Sheng

Priority-based Parameter Propagation for Distributed DNN Training

Data parallel training is widely used for scaling distributed deep neural network (DNN) training. However, the performance benefits are often limited by the communication-heavy parameter synchronization step. In this paper, we take…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-05-13 Anand Jayarajan , Jinliang Wei , Garth Gibson , Alexandra Fedorova , Gennady Pekhimenko