Related papers: TensorSocket: Shared Data Loading for Deep Learnin…

CrossoverScheduler: Overlapping Multiple Distributed Training Applications in a Crossover Manner

Distributed deep learning workloads include throughput-intensive training tasks on the GPU clusters, where the Distributed Stochastic Gradient Descent (SGD) incurs significant communication delays after backward propagation, forces workers…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-16 Cheng Luo , Lei Qu , Youshan Miao , Peng Cheng , Yongqiang Xiong

Accelerating Data Loading in Deep Neural Network Training

Data loading can dominate deep neural network training time on large-scale systems. We present a comprehensive study on accelerating data loading performance in large-scale distributed training. We first identify performance and scalability…

Machine Learning · Computer Science 2020-02-20 Chih-Chieh Yang , Guojing Cong

HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

Deep neural networks (DNNs) exploit many layers and a large number of parameters to achieve excellent performance. The training process of DNN models generally handles large-scale input data with many sparse features, which incurs high…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-08 Ji Liu , Zhihua Wu , Dianhai Yu , Yanjun Ma , Danlei Feng , Minxu Zhang , Xinxuan Wu , Xuefeng Yao , Dejing Dou

Tesseract: Parallelize the Tensor Parallelism Efficiently

Together with the improvements in state-of-the-art accuracies of various tasks, deep learning models are getting significantly larger. However, it is extremely difficult to implement these large models because limited GPU memory makes it…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-02 Boxiang Wang , Qifan Xu , Zhengda Bian , Yang You

Speeding up Deep Learning with Transient Servers

Distributed training frameworks, like TensorFlow, have been proposed as a means to reduce the training time of deep learning models by using a cluster of GPU servers. While such speedups are often desirable---e.g., for rapidly evaluating…

Performance · Computer Science 2019-05-07 Shijian Li , Robert J. Walls , Lijie Xu , Tian Guo

Tensor Casting: Co-Designing Algorithm-Architecture for Personalized Recommendation Training

Personalized recommendations are one of the most widely deployed machine learning (ML) workload serviced from cloud datacenters. As such, architectural solutions for high-performance recommendation inference have recently been the target of…

Hardware Architecture · Computer Science 2020-10-27 Youngeun Kwon , Yunjae Lee , Minsoo Rhu

Learning to Optimize Tensor Programs

We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective…

Machine Learning · Computer Science 2019-01-10 Tianqi Chen , Lianmin Zheng , Eddie Yan , Ziheng Jiang , Thierry Moreau , Luis Ceze , Carlos Guestrin , Arvind Krishnamurthy

A Container-Based Workflow for Distributed Training of Deep Learning Algorithms in HPC Clusters

Deep learning has been postulated as a solution for numerous problems in different branches of science. Given the resource-intensive nature of these models, they often need to be executed on specialized hardware such graphical processing…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-15 Jose González-Abad , Álvaro López García , Valentin Y. Kozlov

Horovod: fast and easy distributed deep learning in TensorFlow

Training modern deep learning models requires large amounts of computation, often provided by GPUs. Scaling computation from one GPU to many can enable much faster training and research progress but entails two complications. First, the…

Machine Learning · Computer Science 2018-02-22 Alexander Sergeev , Mike Del Balso

Distributed Training of Deep Learning Models: A Taxonomic Perspective

Distributed deep learning systems (DDLS) train deep neural network models by utilizing the distributed resources of a cluster. Developers of DDLS are required to make many decisions to process their particular workloads in their chosen…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-09 Matthias Langer , Zhen He , Wenny Rahayu , Yanbo Xue

CROSSBOW: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers

Deep learning models are trained on servers with many GPUs, and training must scale with the number of GPUs. Systems such as TensorFlow and Caffe2 train models with parallel synchronous stochastic gradient descent: they process a batch of…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-01-09 Alexandros Koliousis , Pijika Watcharapichat , Matthias Weidlich , Luo Mai , Paolo Costa , Peter Pietzuch

Workload-Aware Hardware Accelerator Mining for Distributed Deep Learning Training

In this paper, we present a novel technique to search for hardware architectures of accelerators optimized for end-to-end training of deep neural networks (DNNs). Our approach addresses both single-device and distributed pipeline and tensor…

Hardware Architecture · Computer Science 2024-04-24 Muhammad Adnan , Amar Phanishayee , Janardhan Kulkarni , Prashant J. Nair , Divya Mahajan

Towards Federated Learning Under Resource Constraints via Layer-wise Training and Depth Dropout

Large machine learning models trained on diverse data have recently seen unprecedented success. Federated learning enables training on private data that may otherwise be inaccessible, such as domain-specific datasets decentralized across…

Machine Learning · Computer Science 2023-09-12 Pengfei Guo , Warren Richard Morningstar , Raviteja Vemulapalli , Karan Singhal , Vishal M. Patel , Philip Andrew Mansfield

Hyper: Distributed Cloud Processing for Large-Scale Deep Learning Tasks

Training and deploying deep learning models in real-world applications require processing large amounts of data. This is a challenging task when the amount of data grows to a hundred terabytes, or even, petabyte-scale. We introduce a hybrid…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-17 Davit Buniatyan

Privacy-Preserving Serverless Edge Learning with Decentralized Small Data

In the last decade, data-driven algorithms outperformed traditional optimization-based algorithms in many research areas, such as computer vision, natural language processing, etc. However, extensive data usages bring a new challenge or…

Machine Learning · Computer Science 2021-12-02 Shih-Chun Lin , Chia-Hung Lin

Throughput Prediction of Asynchronous SGD in TensorFlow

Modern machine learning frameworks can train neural networks using multiple nodes in parallel, each computing parameter updates with stochastic gradient descent (SGD) and sharing them asynchronously through a central parameter server. Due…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-03-02 Zhuojin Li , Wumo Yan , Marco Paolieri , Leana Golubchik

An introduction to distributed training of deep neural networks for segmentation tasks with large seismic datasets

Deep learning applications are drastically progressing in seismic processing and interpretation tasks. However, the majority of approaches subsample data volumes and restrict model sizes to minimise computational requirements. Subsampling…

Geophysics · Physics 2021-02-26 Claire Birnie , Haithem Jarraya , Fredrik Hansteen

TensorTEE: Unifying Heterogeneous TEE Granularity for Efficient Secure Collaborative Tensor Computing

Heterogeneous collaborative computing with NPU and CPU has received widespread attention due to its substantial performance benefits. To ensure data confidentiality and integrity during computing, Trusted Execution Environments (TEE) is…

Cryptography and Security · Computer Science 2024-07-15 Husheng Han , Xinyao Zheng , Yuanbo Wen , Yifan Hao , Erhu Feng , Ling Liang , Jianan Mu , Xiaqing Li , Tianyun Ma , Pengwei Jin , Xinkai Song , Zidong Du , Qi Guo , Xing Hu

Accelerating Distributed Deep Learning using Lossless Homomorphic Compression

As deep neural networks (DNNs) grow in complexity and size, the resultant increase in communication overhead during distributed training has become a significant bottleneck, challenging the scalability of distributed training systems.…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-02-13 Haoyu Li , Yuchen Xu , Jiayi Chen , Rohit Dwivedula , Wenfei Wu , Keqiang He , Aditya Akella , Daehyeok Kim

Distributed learning of deep neural network over multiple agents

In domains such as health care and finance, shortage of labeled data and computational resources is a critical issue while developing machine learning algorithms. To address the issue of labeled data scarcity in training and deployment of…

Machine Learning · Computer Science 2018-10-16 Otkrist Gupta , Ramesh Raskar