English
Related papers

Related papers: Distributed Deep Learning Using Volunteer Computin…

200 papers

Modern deep learning applications require increasingly more compute to train state-of-the-art models. To address this demand, large corporations and institutions use dedicated High-Performance Computing clusters, whose construction and…

Due to the massive size of the neural network models and training datasets used in machine learning today, it is imperative to distribute stochastic gradient descent (SGD) by splitting up tasks such as gradient evaluation across multiple…

Machine Learning · Computer Science 2020-03-13 Xiaoxi Zhang , Jianyu Wang , Gauri Joshi , Carlee Joe-Wong

Distributed Deep Learning (DDL), as a paradigm, dictates the use of GPU-based clusters as the optimal infrastructure for training large-scale Deep Neural Networks (DNNs). However, the high cost of such resources makes them inaccessible to…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-15 Yoochan Kim , Kihyun Kim , Yonghyeon Cho , Jinwoo Kim , Awais Khan , Ki-Dong Kang , Baik-Song An , Myung-Hoon Cha , Hong-Yeon Kim , Youngjae Kim

While the pay-as-you-go nature of cloud virtual machines (VMs) makes it easy to spin-up large clusters for training ML models, it can also lead to ballooning costs. The 100s of virtual machine sizes provided by cloud platforms also makes it…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-06 Sahil Tyagi , Prateek Sharma

Distributed deep learning (DL) has become prevalent in recent years to reduce training time by leveraging multiple computing devices (e.g., GPUs/TPUs) due to larger models and datasets. However, system scalability is limited by…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-04 Zhenheng Tang , Shaohuai Shi , Wei Wang , Bo Li , Xiaowen Chu

Deep learning has become an indispensable part of life, such as face recognition, NLP, etc., but the training of deep model has always been a challenge, and in recent years, the complexity of training data and models has shown explosive…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-02-18 Sheng Huang

Originally designed for applications in computer graphics, visual computing (VC) methods synthesize information about physical and virtual worlds, using prescribed algorithms optimized for spatial computing. VC is used to analyze geometry,…

We design and implement a distributed multinode synchronous SGD algorithm, without altering hyper parameters, or compressing data, or altering algorithmic behavior. We perform a detailed analysis of scaling, and identify optimal design…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-03-02 Dipankar Das , Sasikanth Avancha , Dheevatsa Mudigere , Karthikeyan Vaidynathan , Srinivas Sridharan , Dhiraj Kalamkar , Bharat Kaul , Pradeep Dubey

Deep learning (DL) models have achieved great success in many application domains. As such, many industrial companies such as Google and Facebook have acknowledged the importance of multi-tenant DL services. Although the multi-tenant…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-01-19 Zihan Liu , Jingwen Leng , Zhihui Zhang , Quan Chen , Chao Li , Minyi Guo

Recent advances in deep learning are driven by the growing scale of computation, data, and models. However, efficiently training large-scale models on distributed systems requires an intricate combination of data, operator, and pipeline…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-22 Jinfan Chen , Shigang Li , Ran Gun , Jinhui Yuan , Torsten Hoefler

Distributed training frameworks, like TensorFlow, have been proposed as a means to reduce the training time of deep learning models by using a cluster of GPU servers. While such speedups are often desirable---e.g., for rapidly evaluating…

Performance · Computer Science 2019-05-07 Shijian Li , Robert J. Walls , Lijie Xu , Tian Guo

Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train. Designing such clusters to maximize both performance and utilization--to amortize their steep cost--is a challenging…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-15 Divya Kiran Kadiyala , Saeed Rashidi , Taekyung Heo , Abhimanyu Rajeshkumar Bambhaniya , Tushar Krishna , Alexandros Daglis

Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances,…

Distributed deep learning systems (DDLS) train deep neural network models by utilizing the distributed resources of a cluster. Developers of DDLS are required to make many decisions to process their particular workloads in their chosen…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-09 Matthias Langer , Zhen He , Wenny Rahayu , Yanbo Xue

Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-art results in various domains such as image recognition and natural language processing. One of the reasons for this success is the increasing size…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-26 Ruben Mayer , Hans-Arno Jacobsen

Artificial Intelligence (AI) and Deep Learning (DL) algorithms are currently applied to a wide range of products and solutions. DL training jobs are highly resource demanding and they experience great benefits when exploiting AI…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-12 Federica Filippini , Danilo Ardagna , Marco Lattuada , Edoardo Amaldi , Michele Ciavotta , Maciek Riedl , Katarzyna Materka , Paweł Skrzypek , Fabrizio Magugliani , Marco Cicala

In recent years, data are typically distributed in multiple organizations while the data security is becoming increasingly important. Federated Learning (FL), which enables multiple parties to collaboratively train a model without…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-03-13 Ji Liu , Xuehai Zhou , Lei Mo , Shilei Ji , Yuan Liao , Zheng Li , Qin Gu , Dejing Dou

In the era of deep learning (DL), convolutional neural networks (CNNs), and large language models (LLMs), machine learning (ML) models are becoming increasingly complex, demanding significant computational resources for both inference and…

Machine Learning · Computer Science 2024-05-27 Madison Threadgill , Andreas Gerstlauer

Despite the notable success of deep neural networks (DNNs) in solving complex tasks, the training process still remains considerable challenges. A primary obstacle is the substantial time required for training, particularly as high…

Machine Learning · Computer Science 2025-09-09 Viet Hoang Pham , Hyo-Sung Ahn

Distributed machine learning (DML) technology makes it possible to train large neural networks in a reasonable amount of time. Meanwhile, as the computing power grows much faster than network capacity, network communication has gradually…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-11 Xinchi Han , Weihao Jiang , Peirui Cao , Qinwei Yang , Yunzhuo Liu , Shuyao Qi , Shengkai Lin , Shizhen Zhao
‹ Prev 1 2 3 10 Next ›