English
Related papers

Related papers: Characterizing and Modeling Distributed Training w…

200 papers

Distributed training frameworks, like TensorFlow, have been proposed as a means to reduce the training time of deep learning models by using a cluster of GPU servers. While such speedups are often desirable---e.g., for rapidly evaluating…

Performance · Computer Science 2019-05-07 Shijian Li , Robert J. Walls , Lijie Xu , Tian Guo

Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances,…

Training and deploying deep learning models in real-world applications require processing large amounts of data. This is a challenging task when the amount of data grows to a hundred terabytes, or even, petabyte-scale. We introduce a hybrid…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-17 Davit Buniatyan

In recent years, the integration of artificial intelligence (AI) and cloud computing has emerged as a promising avenue for addressing the growing computational demands of AI applications. This paper presents a comprehensive study of…

Machine Learning · Computer Science 2023-04-28 Neelesh Mungoli

Deep learning has led to tremendous advancements in the field of Artificial Intelligence. One caveat however is the substantial amount of compute needed to train these deep learning models. Training a benchmark dataset like ImageNet on a…

Machine Learning · Computer Science 2018-10-30 Karanbir Chahal , Manraj Singh Grover , Kuntal Dey

Most existing training systems focus on a single region. In contrast, we envision that cross-region training offers more flexible GPU resource allocation and yields significant potential. However, the hierarchical cluster topology and…

Systems and Control · Electrical Eng. & Systems 2025-05-28 Jinquan Wang , Xiaojian Liao , Xuzhao Liu , Jiashun Suo , Zhisheng Huo , Chenhao Zhang , Xiangrong Xu , Runnan Shen , Xilong Xie , Limin Xiao

Deep learning frameworks have been widely deployed on GPU servers for deep learning applications in both academia and industry. In training deep neural networks (DNNs), there are many standard processes or algorithms, such as convolution…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-08-21 Shaohuai Shi , Qiang Wang , Xiaowen Chu

Data-driven methods for computer simulations are blooming in many scientific areas. The traditional approach to simulating physical behaviors relies on solving partial differential equations (PDE). Since calculating these iterative…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-01 Sergio Iserte , Alejandro González-Barberá , Paloma Barreda , Krzysztof Rojek

Distributed training of deep nets is an important technique to address some of the present day computing challenges like memory consumption and computational demands. Classical distributed approaches, synchronous or asynchronous, are based…

Machine Learning · Computer Science 2019-01-14 Youjie Li , Mingchao Yu , Songze Li , Salman Avestimehr , Nam Sung Kim , Alexander Schwing

Graph neural network (GNN) has been demonstrated to be a powerful model in many domains for its effectiveness in learning over graphs. To scale GNN training for large graphs, a widely adopted approach is distributed training which…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-04-19 Haiyang Lin , Mingyu Yan , Xiaocheng Yang , Mo Zou , Wenming Li , Xiaochun Ye , Dongrui Fan

We develop a scalable and extendable training framework that can utilize GPUs across nodes in a cluster and accelerate the training of deep learning models based on data parallelism. Both synchronous and asynchronous training are…

Machine Learning · Computer Science 2016-05-27 He Ma , Fei Mao , Graham W. Taylor

Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-27 Zhongyi Lin , Ning Sun , Pallab Bhattacharya , Xizhou Feng , Louis Feng , John D. Owens

Cloud computing provides a powerful yet low-cost environment for distributed deep learning workloads. However, training complex deep learning models often requires accessing large amounts of data, which can easily exceed the capacity of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-24 Nicholas Krichevsky , Renee St Louis , Tian Guo

We propose a distributed approach to train deep convolutional generative adversarial neural network (DC-CGANs) models. Our method reduces the imbalance between generator and discriminator by partitioning the training data according to data…

Computer Vision and Pattern Recognition · Computer Science 2021-04-30 Massimiliano Lupo Pasini , Vittorio Gabbi , Junqi Yin , Simona Perotto , Nouamane Laanait

Training transformer models requires substantial GPU compute and memory resources. In homogeneous clusters, distributed strategies allocate resources evenly, but this approach is inefficient for heterogeneous clusters, where GPUs differ in…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-15 Runsheng Benson Guo , Utkarsh Anand , Arthur Chen , Khuzaima Daudjee

Recently, a new paradigm, meta learning, has been widely applied to Deep Learning Recommendation Models (DLRM) and significantly improves statistical performance, especially in cold-start scenarios. However, the existing systems are not…

Machine Learning · Computer Science 2024-04-16 Youshao Xiao , Shangchun Zhao , Zhenglei Zhou , Zhaoxin Huan , Lin Ju , Xiaolu Zhang , Lin Wang , Jun Zhou

Large-scale AI model training divides work across thousands of GPUs, then synchronizes gradients across them at each step. This incurs a significant network burden that only centralized, monolithic clusters can support, driving up…

Computer Vision and Pattern Recognition · Computer Science 2025-01-13 David McAllister , Matthew Tancik , Jiaming Song , Angjoo Kanazawa

In this paper, we evaluate training of deep recurrent neural networks with half-precision floats. We implement a distributed, data-parallel, synchronous training algorithm by integrating TensorFlow and CUDA-aware MPI to enable execution…

Machine Learning · Computer Science 2019-12-03 Alexey Svyatkovskiy , Julian Kates-Harbeck , William Tang

While the pay-as-you-go nature of cloud virtual machines (VMs) makes it easy to spin-up large clusters for training ML models, it can also lead to ballooning costs. The 100s of virtual machine sizes provided by cloud platforms also makes it…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-06 Sahil Tyagi , Prateek Sharma

Due to the massive size of the neural network models and training datasets used in machine learning today, it is imperative to distribute stochastic gradient descent (SGD) by splitting up tasks such as gradient evaluation across multiple…

Machine Learning · Computer Science 2020-03-13 Xiaoxi Zhang , Jianyu Wang , Gauri Joshi , Carlee Joe-Wong
‹ Prev 1 2 3 10 Next ›