Related papers: Distributed Deep Learning in Open Collaborations

Distributed Deep Learning Using Volunteer Computing-Like Paradigm

Use of Deep Learning (DL) in commercial applications such as image classification, sentiment analysis and speech recognition is increasing. When training DL models with large number of parameters and/or large datasets, cost and speed of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-28 Medha Atre , Birendra Jha , Ashwini Rao

A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks

The field of deep learning has witnessed a remarkable shift towards extremely compute- and memory-intensive neural networks. These newer larger models have enabled researchers to advance state-of-the-art tools across a variety of fields.…

Machine Learning · Computer Science 2022-07-04 Daniel Nichols , Siddharth Singh , Shu-Huai Lin , Abhinav Bhatele

A Survey From Distributed Machine Learning to Distributed Deep Learning

Artificial intelligence has made remarkable progress in handling complex tasks, thanks to advances in hardware acceleration and machine learning algorithms. However, to acquire more accurate outcomes and solve more complex issues,…

Machine Learning · Computer Science 2023-09-12 Mohammad Dehghani , Zahra Yazdanparast

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey

With the rapid growth in the volume of data sets, models, and devices in the domain of deep learning, there is increasing attention on large-scale distributed deep learning. In contrast to traditional distributed deep learning, the…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-10 Feng Liang , Zhen Zhang , Haifeng Lu , Victor C. M. Leung , Yanyi Guo , Xiping Hu

Machine Learning on Volatile Instances

Due to the massive size of the neural network models and training datasets used in machine learning today, it is imperative to distribute stochastic gradient descent (SGD) by splitting up tasks such as gradient evaluation across multiple…

Machine Learning · Computer Science 2020-03-13 Xiaoxi Zhang , Jianyu Wang , Gauri Joshi , Carlee Joe-Wong

Resource Allocation and Workload Scheduling for Large-Scale Distributed Deep Learning: A Survey

With rapidly increasing distributed deep learning workloads in large-scale data centers, efficient distributed deep learning framework strategies for resource allocation and workload scheduling have become the key to high-performance deep…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-13 Feng Liang , Zhen Zhang , Haifeng Lu , Chengming Li , Victor C. M. Leung , Yanyi Guo , Xiping Hu

Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools

Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-art results in various domains such as image recognition and natural language processing. One of the reasons for this success is the increasing size…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-26 Ruben Mayer , Hans-Arno Jacobsen

Distributed Training of Deep Learning Models: A Taxonomic Perspective

Distributed deep learning systems (DDLS) train deep neural network models by utilizing the distributed resources of a cluster. Developers of DDLS are required to make many decisions to process their particular workloads in their chosen…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-09 Matthias Langer , Zhen He , Wenny Rahayu , Yanbo Xue

COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training

Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train. Designing such clusters to maximize both performance and utilization--to amortize their steep cost--is a challenging…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-15 Divya Kiran Kadiyala , Saeed Rashidi , Taekyung Heo , Abhimanyu Rajeshkumar Bambhaniya , Tushar Krishna , Alexandros Daglis

Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts

Many recent breakthroughs in deep learning were achieved by training increasingly larger models on massive datasets. However, training such models can be prohibitively expensive. For instance, the cluster used to train GPT-3 costs over…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-07-02 Max Ryabinin , Anton Gusev

Towards Collaborative Optimization of Cluster Configurations for Distributed Dataflow Jobs

Analyzing large datasets with distributed dataflow systems requires the use of clusters. Public cloud providers offer a large variety and quantity of resources that can be used for such clusters. However, picking the appropriate resources…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-28 Jonathan Will , Jonathan Bader , Lauritz Thamsen

Hyper: Distributed Cloud Processing for Large-Scale Deep Learning Tasks

Training and deploying deep learning models in real-world applications require processing large amounts of data. This is a challenging task when the amount of data grows to a hundred terabytes, or even, petabyte-scale. We introduce a hybrid…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-17 Davit Buniatyan

Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity

More than 70% of cloud computing is paid for but sits idle. A large fraction of these idle compute are cheap CPUs with few cores that are not utilized during the less busy hours. This paper aims to enable those CPU cycles to train…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-01 Minghao Yan , Nicholas Meisburger , Tharun Medini , Anshumali Shrivastava

A Hitchhiker's Guide On Distributed Training of Deep Neural Networks

Deep learning has led to tremendous advancements in the field of Artificial Intelligence. One caveat however is the substantial amount of compute needed to train these deep learning models. Training a benchmark dataset like ImageNet on a…

Machine Learning · Computer Science 2018-10-30 Karanbir Chahal , Manraj Singh Grover , Kuntal Dey

Combining Federated and Active Learning for Communication-efficient Distributed Failure Prediction in Aeronautics

Machine Learning has proven useful in the recent years as a way to achieve failure prediction for industrial systems. However, the high computational resources necessary to run learning algorithms are an obstacle to its widespread…

Artificial Intelligence · Computer Science 2020-01-22 Nicolas Aussel , Sophie Chabridon , Yohan Petetin

A Framework for Incentivized Collaborative Learning

Collaborations among various entities, such as companies, research labs, AI agents, and edge devices, have become increasingly crucial for achieving machine learning tasks that cannot be accomplished by a single entity alone. This is likely…

Machine Learning · Computer Science 2023-05-29 Xinran Wang , Qi Le , Ahmad Faraz Khan , Jie Ding , Ali Anwar

Scalable, Distributed AI Frameworks: Leveraging Cloud Computing for Enhanced Deep Learning Performance and Efficiency

In recent years, the integration of artificial intelligence (AI) and cloud computing has emerged as a promising avenue for addressing the growing computational demands of AI applications. This paper presents a comprehensive study of…

Machine Learning · Computer Science 2023-04-28 Neelesh Mungoli

Distributed learning of deep neural network over multiple agents

In domains such as health care and finance, shortage of labeled data and computational resources is a critical issue while developing machine learning algorithms. To address the issue of labeled data scarcity in training and deployment of…

Machine Learning · Computer Science 2018-10-16 Otkrist Gupta , Ramesh Raskar

Decentralized adaptive clustering of deep nets is beneficial for client collaboration

We study the problem of training personalized deep learning models in a decentralized peer-to-peer setting, focusing on the setting where data distributions differ between the clients and where different clients have different local…

Machine Learning · Computer Science 2022-11-01 Edvin Listo Zec , Ebba Ekblom , Martin Willbo , Olof Mogren , Sarunas Girdzijauskas

Big Data Intelligence Using Distributed Deep Neural Networks

Large amount of data is often required to train and deploy useful machine learning models in industry. Smaller enterprises do not have the luxury of accessing enough data for machine learning, For privacy sensitive fields such as banking,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-05 Felix Ongati , Eng. Lawrence Muchemi