Related papers: Hyper: Distributed Cloud Processing for Large-Scal…

Scalable, Distributed AI Frameworks: Leveraging Cloud Computing for Enhanced Deep Learning Performance and Efficiency

In recent years, the integration of artificial intelligence (AI) and cloud computing has emerged as a promising avenue for addressing the growing computational demands of AI applications. This paper presents a comprehensive study of…

Machine Learning · Computer Science 2023-04-28 Neelesh Mungoli

Towards Distributed Petascale Computing

In this chapter we will argue that studying such multi-scale multi-science systems gives rise to inherently hybrid models containing many different algorithms best serviced by different types of computing environments (ranging from…

Astrophysics · Physics 2007-05-23 A. G. Hoekstra , S. F. Portegies Zwart , M. Bubak , P. M. A. Sloot

HTC Scientific Computing in a Distributed Cloud Environment

This paper describes the use of a distributed cloud computing system for high-throughput computing (HTC) scientific applications. The distributed cloud computing system is composed of a number of separate Infrastructure-as-a-Service (IaaS)…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-02-11 R. Sobie , A. Agarwal , I. Gable , C. Leavett-Brown , M. Paterson , R. Taylor , A. Charbonneau , R. Impey , W. Podiama

Speeding up Deep Learning with Transient Servers

Distributed training frameworks, like TensorFlow, have been proposed as a means to reduce the training time of deep learning models by using a cluster of GPU servers. While such speedups are often desirable---e.g., for rapidly evaluating…

Performance · Computer Science 2019-05-07 Shijian Li , Robert J. Walls , Lijie Xu , Tian Guo

Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-21 Shaohuai Shi , Xianhao Zhou , Shutao Song , Xingyao Wang , Zilin Zhu , Xue Huang , Xinan Jiang , Feihu Zhou , Zhenyu Guo , Liqiang Xie , Rui Lan , Xianbin Ouyang , Yan Zhang , Jieqian Wei , Jing Gong , Weiliang Lin , Ping Gao , Peng Meng , Xiaomin Xu , Chenyang Guo , Bo Yang , Zhibo Chen , Yongjian Wu , Xiaowen Chu

Scalability Evaluation of HPC Multi-GPU Training for ECG-based LLMs

Training large language models requires extensive processing, made possible by many high-performance computing resources. This study compares multi-node and multi-GPU environments for training large language models of electrocardiograms. It…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-28 Dimitar Mileski , Nikola Petrovski , Marjan Gusev

Federated Learning Framework for Scalable AI in Heterogeneous HPC and Cloud Environments

As the demand grows for scalable and privacy-aware AI systems, Federated Learning (FL) has emerged as a promising solution, allowing decentralized model training without moving raw data. At the same time, the combination of high-performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-26 Sangam Ghimire , Paribartan Timalsina , Nirjal Bhurtel , Bishal Neupane , Bigyan Byanju Shrestha , Subarna Bhattarai , Prajwal Gaire , Jessica Thapa , Sudan Jha

Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

Cloud GPU servers have become the de facto way for deep learning practitioners to train complex models on large-scale datasets. However, it is challenging to determine the appropriate cluster configuration---e.g., server type and…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-08 Shijian Li , Robert J. Walls , Tian Guo

A Container-Based Workflow for Distributed Training of Deep Learning Algorithms in HPC Clusters

Deep learning has been postulated as a solution for numerous problems in different branches of science. Given the resource-intensive nature of these models, they often need to be executed on specialized hardware such graphical processing…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-15 Jose González-Abad , Álvaro López García , Valentin Y. Kozlov

A Privacy-Preserving Cloud Architecture for Distributed Machine Learning at Scale

Distributed machine learning systems require strong privacy guarantees, verifiable compliance, and scalable deployment across heterogeneous and multi-cloud environments. This work introduces a cloud-native privacy-preserving architecture…

Machine Learning · Computer Science 2025-12-13 Vinoth Punniyamoorthy , Ashok Gadi Parthi , Mayilsamy Palanigounder , Ravi Kiran Kodali , Bikesh Kumar , Kabilan Kannan

Deep Learning on Operational Facility Data Related to Large-Scale Distributed Area Scientific Workflows

Distributed computing platforms provide a robust mechanism to perform large-scale computations by splitting the task and data among multiple locations, possibly located thousands of miles apart geographically. Although such distribution of…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-04-24 Alok Singh , Eric Stephan , Malachi Schram , Ilkay Altintas

Incremental Learning Framework Using Cloud Computing

High volume of data, perceived as either challenge or opportunity. Deep learning architecture demands high volume of data to effectively back propagate and train the weights without bias. At the same time, large volume of data demands…

Machine Learning · Statistics 2018-05-15 Kumarjit Pathak , Prabhukiran G , Jitin Kapila , Nikit Gawande

ChainerMN: Scalable Distributed Deep Learning Framework

One of the keys for deep learning to have made a breakthrough in various fields was to utilize high computing powers centering around GPUs. Enabling the use of further computing abilities by distributed processing is essential not only to…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-01 Takuya Akiba , Keisuke Fukuda , Shuji Suzuki

A Data and Model-Parallel, Distributed and Scalable Framework for Training of Deep Networks in Apache Spark

Training deep networks is expensive and time-consuming with the training period increasing with data size and growth in model parameters. In this paper, we provide a framework for distributed training of deep networks over a cluster of CPUs…

Machine Learning · Statistics 2017-08-22 Disha Shrivastava , Santanu Chaudhury , Dr. Jayadeva

Decentralized Diffusion Models

Large-scale AI model training divides work across thousands of GPUs, then synchronizes gradients across them at each step. This incurs a significant network burden that only centralized, monolithic clusters can support, driving up…

Computer Vision and Pattern Recognition · Computer Science 2025-01-13 David McAllister , Matthew Tancik , Jiaming Song , Angjoo Kanazawa

A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks

The field of deep learning has witnessed a remarkable shift towards extremely compute- and memory-intensive neural networks. These newer larger models have enabled researchers to advance state-of-the-art tools across a variety of fields.…

Machine Learning · Computer Science 2022-07-04 Daniel Nichols , Siddharth Singh , Shu-Huai Lin , Abhinav Bhatele

Scalable Cross-Facility Federated Learning for Scientific Foundation Models on Multiple Supercomputers

Artificial Intelligence for scientific applications increasingly requires training large models on data that cannot be centralized due to privacy constraints, data sovereignty, or the sheer volume of data generated. Federated learning (FL)…

Machine Learning · Computer Science 2026-03-23 Yijiang Li , Zilinghan Li , Kyle Chard , Ian Foster , Todd Munson , Ravi Madduri , Kibaek Kim

GraphLab: A Distributed Framework for Machine Learning in the Cloud

Machine Learning (ML) techniques are indispensable in a wide range of fields. Unfortunately, the exponential increase of dataset sizes are rapidly extending the runtime of sequential algorithms and threatening to slow future progress in ML.…

Machine Learning · Computer Science 2011-07-06 Yucheng Low , Joseph Gonzalez , Aapo Kyrola , Danny Bickson , Carlos Guestrin

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

Deep learning frameworks have been widely deployed on GPU servers for deep learning applications in both academia and industry. In training deep neural networks (DNNs), there are many standard processes or algorithms, such as convolution…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-08-21 Shaohuai Shi , Qiang Wang , Xiaowen Chu

Deep Transfer Hashing for Adaptive Learning on Federated Streaming Data

This extended abstract explores the integration of federated learning with deep transfer hashing for distributed prediction tasks, emphasizing resource-efficient client training from evolving data streams. Federated learning allows multiple…

Machine Learning · Computer Science 2024-09-20 Manuel Röder , Frank-Michael Schleif