Related papers: Dynamic backup workers for parallel machine learni…

Straggler-Resilient Distributed Machine Learning with Dynamic Backup Workers

With the increasing demand for large-scale training of machine learning models, consensus-based distributed optimization methods have recently been advocated as alternatives to the popular parameter server framework. In this paradigm, each…

Machine Learning · Computer Science 2021-02-15 Guojun Xiong , Gang Yan , Rahul Singh , Jian Li

Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning

Deep learning is a popular machine learning technique and has been applied to many real-world problems. However, training a deep neural network is very time-consuming, especially on big data. It has become difficult for a single machine to…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-04 Xing Zhao , Aijun An , Junfeng Liu , Bao Xin Chen

Online Job Scheduling in Distributed Machine Learning Clusters

Nowadays large-scale distributed machine learning systems have been deployed to support various analytics and intelligence services in IT firms. To train a large dataset and derive the prediction/inference model, e.g., a deep neural…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-04 Yixin Bao , Yanghua Peng , Chuan Wu , Zongpeng Li

Dynamic Parameter Allocation in Parameter Servers

To keep up with increasing dataset sizes and model complexity, distributed training has become a necessity for large machine learning tasks. Parameter servers ease the implementation of distributed parameter management---a key concern in…

Machine Learning · Computer Science 2020-07-06 Alexander Renz-Wieland , Rainer Gemulla , Steffen Zeuch , Volker Markl

PSO-PS: Parameter Synchronization with Particle Swarm Optimization for Distributed Training of Deep Neural Networks

Parameter updating is an important stage in parallelism-based distributed deep learning. Synchronous methods are widely used in distributed training the Deep Neural Networks (DNNs). To reduce the communication and synchronization overhead…

Machine Learning · Computer Science 2020-09-09 Qing Ye , Yuxuan Han , Yanan sun , JIancheng Lv

Elastic Bulk Synchronous Parallel Model for Distributed Deep Learning

The bulk synchronous parallel (BSP) is a celebrated synchronization model for general-purpose parallel computing that has successfully been employed for distributed training of machine learning models. A prevalent shortcoming of the BSP is…

Machine Learning · Computer Science 2020-01-07 Xing Zhao , Manos Papagelis , Aijun An , Bao Xin Chen , Junfeng Liu , Yonggang Hu

Revisiting Distributed Synchronous SGD

Distributed training of deep learning models on large-scale training data is typically conducted with asynchronous stochastic optimization to maximize the rate of updates, at the cost of additional noise introduced from asynchrony. In…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-03-21 Xinghao Pan , Jianmin Chen , Rajat Monga , Samy Bengio , Rafal Jozefowicz

Revisiting Distributed Synchronous SGD

Distributed training of deep learning models on large-scale training data is typically conducted with asynchronous stochastic optimization to maximize the rate of updates, at the cost of additional noise introduced from asynchrony. In…

Machine Learning · Computer Science 2017-03-22 Jianmin Chen , Xinghao Pan , Rajat Monga , Samy Bengio , Rafal Jozefowicz

Semi-Dynamic Load Balancing: Efficient Distributed Learning in Non-Dedicated Environments

Machine learning (ML) models are increasingly trained in clusters with non-dedicated workers possessing heterogeneous resources. In such scenarios, model training efficiency can be negatively affected by stragglers -- workers that run much…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-09 Chen Chen , Qizhen Weng , Wei Wang , Baochun Li , Bo Li

DBS: Dynamic Batch Size For Distributed Deep Neural Network Training

Synchronous strategies with data parallelism, such as the Synchronous StochasticGradient Descent (S-SGD) and the model averaging methods, are widely utilizedin distributed training of Deep Neural Networks (DNNs), largely owing to itseasy…

Machine Learning · Computer Science 2022-11-04 Qing Ye , Yuhao Zhou , Mingjia Shi , Yanan Sun , Jiancheng Lv

Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training

Distributed deep neural network (DDNN) training constitutes an increasingly important workload that frequently runs in the cloud. Larger DNN models and faster compute engines are shifting DDNN training bottlenecks from computation to…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-22 Liang Luo , Jacob Nelson , Luis Ceze , Amar Phanishayee , Arvind Krishnamurthy

Learning While Scheduling in Multi-Server Systems with Unknown Statistics: MaxWeight with Discounted UCB

Multi-server queueing systems are widely used models for job scheduling in machine learning, wireless networks, crowdsourcing, and healthcare systems. This paper considers a multi-server system with multiple servers and multiple types of…

Machine Learning · Computer Science 2023-06-05 Zixian Yang , R. Srikant , Lei Ying

Population Based Training of Neural Networks

Neural networks dominate the modern machine learning landscape, but their training and success still suffer from sensitivity to empirical choices of hyperparameters such as model architecture, loss function, and optimisation algorithm. In…

Machine Learning · Computer Science 2017-11-29 Max Jaderberg , Valentin Dalibard , Simon Osindero , Wojciech M. Czarnecki , Jeff Donahue , Ali Razavi , Oriol Vinyals , Tim Green , Iain Dunning , Karen Simonyan , Chrisantha Fernando , Koray Kavukcuoglu

Elastic Model Aggregation with Parameter Service

Model aggregation, the process that updates model parameters, is an important step for model convergence in distributed deep learning (DDL). However, the parameter server (PS), a popular paradigm of performing model aggregation, causes CPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-04-08 Juncheng Gu , Mosharaf Chowdhury , Kang G. Shin , Aditya Akella

Adaptive Tuning Algorithm for Performance tuning of Database Management System

Performance tuning of Database Management Systems(DBMS) is both complex and challenging as it involves identifying and altering several key performance tuning parameters. The quality of tuning and the extent of performance enhancement…

Databases · Computer Science 2010-05-07 S. F. Rodd , U. P. Kulkarni

NuPS: A Parameter Server for Machine Learning with Non-Uniform Parameter Access

Parameter servers (PSs) facilitate the implementation of distributed training for large machine learning tasks. In this paper, we argue that existing PSs are inefficient for tasks that exhibit non-uniform parameter access; their performance…

Databases · Computer Science 2022-03-29 Alexander Renz-Wieland , Rainer Gemulla , Zoi Kaoudi , Volker Markl

Probabilistic Synchronous Parallel

Most machine learning and deep neural network algorithms rely on certain iterative algorithms to optimise their utility/cost functions, e.g. Stochastic Gradient Descent. In distributed learning, the networked nodes have to work…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-10-06 Liang Wang , Ben Catterall , Richard Mortier

Straggler-Resilient Decentralized Learning via Adaptive Asynchronous Updates

With the increasing demand for large-scale training of machine learning models, fully decentralized optimization methods have recently been advocated as alternatives to the popular parameter server framework. In this paradigm, each worker…

Machine Learning · Computer Science 2024-07-10 Guojun Xiong , Gang Yan , Shiqiang Wang , Jian Li

Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training

Deploying deep learning (DL) models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used…

Machine Learning · Computer Science 2022-11-08 Saptadeep Pal , Eiman Ebrahimi , Arslan Zulfiqar , Yaosheng Fu , Victor Zhang , Szymon Migacz , David Nellans , Puneet Gupta

Large Deviations Optimal Scheduling of Closed Queueing Networks

We study the design of dynamic scheduling controls in closed queueing networks with a fixed number of jobs. Each time a server becomes available, the controller has (limited) flexibility in choosing the buffer from which to serve a job. If…

Probability · Mathematics 2022-10-18 Siddhartha Banerjee , Yash Kanoria , Pengyu Qian