English
Related papers

Related papers: Dynamic backup workers for parallel machine learni…

200 papers

With the increasing demand for large-scale training of machine learning models, consensus-based distributed optimization methods have recently been advocated as alternatives to the popular parameter server framework. In this paradigm, each…

Machine Learning · Computer Science 2021-02-15 Guojun Xiong , Gang Yan , Rahul Singh , Jian Li

Deep learning is a popular machine learning technique and has been applied to many real-world problems. However, training a deep neural network is very time-consuming, especially on big data. It has become difficult for a single machine to…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-04 Xing Zhao , Aijun An , Junfeng Liu , Bao Xin Chen

Nowadays large-scale distributed machine learning systems have been deployed to support various analytics and intelligence services in IT firms. To train a large dataset and derive the prediction/inference model, e.g., a deep neural…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-04 Yixin Bao , Yanghua Peng , Chuan Wu , Zongpeng Li

To keep up with increasing dataset sizes and model complexity, distributed training has become a necessity for large machine learning tasks. Parameter servers ease the implementation of distributed parameter management---a key concern in…

Machine Learning · Computer Science 2020-07-06 Alexander Renz-Wieland , Rainer Gemulla , Steffen Zeuch , Volker Markl

Parameter updating is an important stage in parallelism-based distributed deep learning. Synchronous methods are widely used in distributed training the Deep Neural Networks (DNNs). To reduce the communication and synchronization overhead…

Machine Learning · Computer Science 2020-09-09 Qing Ye , Yuxuan Han , Yanan sun , JIancheng Lv

The bulk synchronous parallel (BSP) is a celebrated synchronization model for general-purpose parallel computing that has successfully been employed for distributed training of machine learning models. A prevalent shortcoming of the BSP is…

Machine Learning · Computer Science 2020-01-07 Xing Zhao , Manos Papagelis , Aijun An , Bao Xin Chen , Junfeng Liu , Yonggang Hu

Distributed training of deep learning models on large-scale training data is typically conducted with asynchronous stochastic optimization to maximize the rate of updates, at the cost of additional noise introduced from asynchrony. In…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-03-21 Xinghao Pan , Jianmin Chen , Rajat Monga , Samy Bengio , Rafal Jozefowicz

Distributed training of deep learning models on large-scale training data is typically conducted with asynchronous stochastic optimization to maximize the rate of updates, at the cost of additional noise introduced from asynchrony. In…

Machine Learning · Computer Science 2017-03-22 Jianmin Chen , Xinghao Pan , Rajat Monga , Samy Bengio , Rafal Jozefowicz

Machine learning (ML) models are increasingly trained in clusters with non-dedicated workers possessing heterogeneous resources. In such scenarios, model training efficiency can be negatively affected by stragglers -- workers that run much…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-09 Chen Chen , Qizhen Weng , Wei Wang , Baochun Li , Bo Li

Synchronous strategies with data parallelism, such as the Synchronous StochasticGradient Descent (S-SGD) and the model averaging methods, are widely utilizedin distributed training of Deep Neural Networks (DNNs), largely owing to itseasy…

Machine Learning · Computer Science 2022-11-04 Qing Ye , Yuhao Zhou , Mingjia Shi , Yanan Sun , Jiancheng Lv

Distributed deep neural network (DDNN) training constitutes an increasingly important workload that frequently runs in the cloud. Larger DNN models and faster compute engines are shifting DDNN training bottlenecks from computation to…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-22 Liang Luo , Jacob Nelson , Luis Ceze , Amar Phanishayee , Arvind Krishnamurthy

Multi-server queueing systems are widely used models for job scheduling in machine learning, wireless networks, crowdsourcing, and healthcare systems. This paper considers a multi-server system with multiple servers and multiple types of…

Machine Learning · Computer Science 2023-06-05 Zixian Yang , R. Srikant , Lei Ying

Neural networks dominate the modern machine learning landscape, but their training and success still suffer from sensitivity to empirical choices of hyperparameters such as model architecture, loss function, and optimisation algorithm. In…

Model aggregation, the process that updates model parameters, is an important step for model convergence in distributed deep learning (DDL). However, the parameter server (PS), a popular paradigm of performing model aggregation, causes CPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-04-08 Juncheng Gu , Mosharaf Chowdhury , Kang G. Shin , Aditya Akella

Performance tuning of Database Management Systems(DBMS) is both complex and challenging as it involves identifying and altering several key performance tuning parameters. The quality of tuning and the extent of performance enhancement…

Databases · Computer Science 2010-05-07 S. F. Rodd , U. P. Kulkarni

Parameter servers (PSs) facilitate the implementation of distributed training for large machine learning tasks. In this paper, we argue that existing PSs are inefficient for tasks that exhibit non-uniform parameter access; their performance…

Databases · Computer Science 2022-03-29 Alexander Renz-Wieland , Rainer Gemulla , Zoi Kaoudi , Volker Markl

Most machine learning and deep neural network algorithms rely on certain iterative algorithms to optimise their utility/cost functions, e.g. Stochastic Gradient Descent. In distributed learning, the networked nodes have to work…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-10-06 Liang Wang , Ben Catterall , Richard Mortier

With the increasing demand for large-scale training of machine learning models, fully decentralized optimization methods have recently been advocated as alternatives to the popular parameter server framework. In this paradigm, each worker…

Machine Learning · Computer Science 2024-07-10 Guojun Xiong , Gang Yan , Shiqiang Wang , Jian Li

Deploying deep learning (DL) models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used…

Machine Learning · Computer Science 2022-11-08 Saptadeep Pal , Eiman Ebrahimi , Arslan Zulfiqar , Yaosheng Fu , Victor Zhang , Szymon Migacz , David Nellans , Puneet Gupta

We study the design of dynamic scheduling controls in closed queueing networks with a fixed number of jobs. Each time a server becomes available, the controller has (limited) flexibility in choosing the buffer from which to serve a job. If…

Probability · Mathematics 2022-10-18 Siddhartha Banerjee , Yash Kanoria , Pengyu Qian
‹ Prev 1 2 3 10 Next ›