Related papers: Parameter Box: High Performance Parameter Servers …

Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training

Distributed deep neural network (DDNN) training constitutes an increasingly important workload that frequently runs in the cloud. Larger DNN models and faster compute engines are shifting DDNN training bottlenecks from computation to…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-22 Liang Luo , Jacob Nelson , Luis Ceze , Amar Phanishayee , Arvind Krishnamurthy

HyperTune: Dynamic Hyperparameter Tuning For Efficient Distribution of DNN Training Over Heterogeneous Systems

Distributed training is a novel approach to accelerate Deep Neural Networks (DNN) training, but common training libraries fall short of addressing the distributed cases with heterogeneous processors or the cases where the processing nodes…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-17 Ali HeydariGorji , Siavash Rezaei , Mahdi Torabzadehkashi , Hossein Bobarshad , Vladimir Alves , Pai H. Chou

Performance Characterization of Distributed Deep Learning Strategies: A Quantitative Evaluation of DDP, FSDP, and Parameter Server Architectures on GPU Clusters

Efficiently scaling deep neural networks across GPU clusters requires navigating complex trade-offs between computational throughput, memory utilization, and synchronization overhead. This paper presents a unified empirical evaluation of…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-06 Md Sultanul Islam Ovi

FTPipeHD: A Fault-Tolerant Pipeline-Parallel Distributed Training Framework for Heterogeneous Edge Devices

With the increased penetration and proliferation of Internet of Things (IoT) devices, there is a growing trend towards distributing the power of deep learning (DL) across edge devices rather than centralizing it in the cloud. This…

Machine Learning · Computer Science 2021-10-07 Yuhao Chen , Qianqian Yang , Shibo He , Zhiguo Shi , Jiming Chen

Priority-based Parameter Propagation for Distributed DNN Training

Data parallel training is widely used for scaling distributed deep neural network (DNN) training. However, the performance benefits are often limited by the communication-heavy parameter synchronization step. In this paper, we take…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-05-13 Anand Jayarajan , Jinliang Wei , Garth Gibson , Alexandra Fedorova , Gennady Pekhimenko

Homomorphic Parameter Compression for Distributed Deep Learning Training

Distributed training of deep neural networks has received significant research interest, and its major approaches include implementations on multiple GPUs and clusters. Parallelization can dramatically improve the efficiency of training…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-29 Jaehee Jang , Byungook Na , Sungroh Yoon

dPRO: A Generic Profiling and Optimization System for Expediting Distributed DNN Training

Distributed training using multiple devices (e.g., GPUs) has been widely adopted for learning DNN models over large datasets. However, the performance of large-scale distributed training tends to be far from linear speed-up in practice.…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-19 Hanpeng Hu , Chenyu Jiang , Yuchen Zhong , Yanghua Peng , Chuan Wu , Yibo Zhu , Haibin Lin , Chuanxiong Guo

HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

Deep neural networks (DNNs) exploit many layers and a large number of parameters to achieve excellent performance. The training process of DNN models generally handles large-scale input data with many sparse features, which incurs high…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-08 Ji Liu , Zhihua Wu , Dianhai Yu , Yanjun Ma , Danlei Feng , Minxu Zhang , Xinxuan Wu , Xuefeng Yao , Dejing Dou

Distributed Training of Deep Neural Networks with Theoretical Analysis: Under SSP Setting

We propose a distributed approach to train deep neural networks (DNNs), which has guaranteed convergence theoretically and great scalability empirically: close to 6 times faster on instance of ImageNet data set when run with 6 machines. The…

Machine Learning · Statistics 2016-10-04 Abhimanu Kumar , Pengtao Xie , Junming Yin , Eric P. Xing

Distributed Deep Neural Networks over the Cloud, the Edge and End Devices

We propose distributed deep neural networks (DDNNs) over distributed computing hierarchies, consisting of the cloud, the edge (fog) and end devices. While being able to accommodate inference of a deep neural network (DNN) in the cloud, a…

Computer Vision and Pattern Recognition · Computer Science 2017-09-08 Surat Teerapittayanon , Bradley McDanel , H. T. Kung

Acceleration of Deep Neural Network Training with Resistive Cross-Point Devices

In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs,…

Machine Learning · Computer Science 2017-05-24 Tayfun Gokmen , Yurii Vlasov

Accelerated Training for CNN Distributed Deep Learning through Automatic Resource-Aware Layer Placement

The Convolutional Neural Network (CNN) model, often used for image classification, requires significant training time to obtain high accuracy. To this end, distributed training is performed with the parameter server (PS) architecture using…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-01-18 Jay H. Park , Sunghwan Kim , Jinwon Lee , Myeongjae Jeon , Sam H. Noh

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

The training process of Deep Neural Network (DNN) is compute-intensive, often taking days to weeks to train a DNN model. Therefore, parallel execution of DNN training on GPUs is a widely adopted approach to speed up the process nowadays.…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-29 Chi-Chung Chen , Chia-Lin Yang , Hsiang-Yun Cheng

Distributed Training and Optimization Of Neural Networks

Deep learning models are yielding increasingly better performances thanks to multiple factors. To be successful, model may have large number of parameters or complex architectures and be trained on large dataset. This leads to large…

Machine Learning · Computer Science 2022-12-20 Jean-Roch Vlimant , Junqi Yin

Hyper: Distributed Cloud Processing for Large-Scale Deep Learning Tasks

Training and deploying deep learning models in real-world applications require processing large amounts of data. This is a challenging task when the amount of data grows to a hundred terabytes, or even, petabyte-scale. We introduce a hybrid…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-17 Davit Buniatyan

DynaComm: Accelerating Distributed CNN Training between Edges and Clouds through Dynamic Communication Scheduling

To reduce uploading bandwidth and address privacy concerns, deep learning at the network edge has been an emerging topic. Typically, edge devices collaboratively train a shared model using real-time generated data through the Parameter…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-11 Shangming Cai , Dongsheng Wang , Haixia Wang , Yongqiang Lyu , Guangquan Xu , Xi Zheng , Athanasios V. Vasilakos

A Data and Model-Parallel, Distributed and Scalable Framework for Training of Deep Networks in Apache Spark

Training deep networks is expensive and time-consuming with the training period increasing with data size and growth in model parameters. In this paper, we provide a framework for distributed training of deep networks over a cluster of CPUs…

Machine Learning · Statistics 2017-08-22 Disha Shrivastava , Santanu Chaudhury , Dr. Jayadeva

Efficient Pipeline Planning for Expedited Distributed DNN Training

To train modern large DNN models, pipeline parallelism has recently emerged, which distributes the model across GPUs and enables different devices to process different microbatches in pipeline. Earlier pipeline designs allow multiple…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-23 Ziyue Luo , Xiaodong Yi , Guoping Long , Shiqing Fan , Chuan Wu , Jun Yang , Wei Lin

PacTrain: Pruning and Adaptive Sparse Gradient Compression for Efficient Collective Communication in Distributed Deep Learning

Large-scale deep neural networks (DNN) exhibit excellent performance for various tasks. As DNNs and datasets grow, distributed training becomes extremely time-consuming and demands larger clusters. A main bottleneck is the resulting…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-27 Yisu Wang , Ruilong Wu , Xinjiao Li , Dirk Kutscher

PSO-PS: Parameter Synchronization with Particle Swarm Optimization for Distributed Training of Deep Neural Networks

Parameter updating is an important stage in parallelism-based distributed deep learning. Synchronous methods are widely used in distributed training the Deep Neural Networks (DNNs). To reduce the communication and synchronization overhead…

Machine Learning · Computer Science 2020-09-09 Qing Ye , Yuxuan Han , Yanan sun , JIancheng Lv