Related papers: Galvatron: Efficient Transformer Training over Mul…

Improving Automatic Parallel Training via Balanced Memory Workload Optimization

Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains, serving as the foundation for advanced large-scale deep learning (DL) models. However, efficiently…

Machine Learning · Computer Science 2024-09-06 Yujie Wang , Youhe Jiang , Xupeng Miao , Fangcheng Fu , Shenhan Zhu , Xiaonan Nie , Yaofeng Tu , Bin Cui

Galvatron: Automatic Distributed Training for Large Transformer Models

Training multi-billion to trillion-parameter language models efficiently on GPU clusters requires leveraging multiple parallelism strategies. We present Galvatron, a novel open-source framework (dubbed 'Optimus-Megatron' in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-08 Esmail Gumaan

Galvatron: An Automatic Distributed System for Efficient Foundation Model Training

Galvatron is a distributed system for efficiently training large-scale Foundation Models. It overcomes the complexities of selecting optimal parallelism strategies by automatically identifying the most efficient hybrid strategy,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-01 Xinyi Liu , Yujie Wang , Shenhan Zhu , Fangcheng Fu , Qingshuo Liu , Guangming Lin , Bin Cui

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on…

Computation and Language · Computer Science 2021-08-25 Deepak Narayanan , Mohammad Shoeybi , Jared Casper , Patrick LeGresley , Mostofa Patwary , Vijay Anand Korthikanti , Dmitri Vainbrand , Prethvi Kashinkunti , Julie Bernauer , Bryan Catanzaro , Amar Phanishayee , Matei Zaharia

Systems for Parallel and Distributed Large-Model Deep Learning Training

Deep learning (DL) has transformed applications in a variety of domains, including computer vision, natural language processing, and tabular data analysis. The search for improved DL model accuracy has led practitioners to explore…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-10 Kabir Nagrecha

Scalable and Adaptive Parallel Training of Graph Transformer on Large Graphs

Graph foundation models have demonstrated remarkable adaptability across diverse downstream tasks through large-scale pretraining on graphs. However, existing implementations of the backbone model, graph transformers, are typically limited…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-21 Jun-Liang Lin , Kamesh Madduri , Mahmut Taylan Kandemir

Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training

Deploying deep learning (DL) models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used…

Machine Learning · Computer Science 2022-11-08 Saptadeep Pal , Eiman Ebrahimi , Arslan Zulfiqar , Yaosheng Fu , Victor Zhang , Szymon Migacz , David Nellans , Puneet Gupta

Computron: Serving Distributed Deep Learning Models with Model Parallel Swapping

Many of the most performant deep learning models today in fields like language and image understanding are fine-tuned models that contain billions of parameters. In anticipation of workloads that involve serving many of such large models to…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-27 Daniel Zou , Xinchen Jin , Xueyang Yu , Hao Zhang , James Demmel

A Comparative Analysis of Distributed Training Strategies for GPT-2

The rapid advancement in Large Language Models has been met with significant challenges in their training processes, primarily due to their considerable computational and memory demands. This research examines parallelization techniques…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-27 Ishan Patwardhan , Shubham Gandhi , Om Khare , Amit Joshi , Suraj Sawant

Parallelizing Training of Deep Generative Models on Massive Scientific Datasets

Training deep neural networks on large scientific data is a challenging task that requires enormous compute power, especially if no pre-trained models exist to initialize the process. We present a novel tournament method to train…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-08 Sam Ade Jacobs , Brian Van Essen , David Hysom , Jae-Seung Yeom , Tim Moon , Rushil Anirudh , Jayaraman J. Thiagaranjan , Shusen Liu , Peer-Timo Bremer , Jim Gaffney , Tom Benson , Peter Robinson , Luc Peterson , Brian Spears

Decentralized Training of Foundation Models in Heterogeneous Environments

Training foundation models, such as GPT-3 and PaLM, can be extremely expensive, often involving tens of thousands of GPUs running continuously for months. These models are typically trained in specialized clusters featuring fast,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-22 Binhang Yuan , Yongjun He , Jared Quincy Davis , Tianyi Zhang , Tri Dao , Beidi Chen , Percy Liang , Christopher Re , Ce Zhang

ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment

The advent of the Transformer architecture has propelled the growth of natural language processing (NLP) models, leading to remarkable achievements in numerous NLP tasks. Yet, the absence of specialized hardware like expansive GPU memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-18 Xiaofeng Wu , Jia Rao , Wei Chen

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

The training process of Deep Neural Network (DNN) is compute-intensive, often taking days to weeks to train a DNN model. Therefore, parallel execution of DNN training on GPUs is a widely adopted approach to speed up the process nowadays.…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-29 Chi-Chung Chen , Chia-Lin Yang , Hsiang-Yun Cheng

Two-dimensional Sparse Parallelism for Large Scale Deep Learning Recommendation Model Training

The increasing complexity of deep learning recommendation models (DLRM) has led to a growing need for large-scale distributed systems that can efficiently train vast amounts of data. In DLRM, the sparse embedding table is a crucial…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-07 Xin Zhang , Quanyu Zhu , Liangbei Xu , Zain Huda , Wang Zhou , Jin Fang , Dennis van der Staay , Yuxi Hu , Jade Nie , Jiyan Yang , Chunzhi Yang

Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks

The past few years have witnessed growth in the computational requirements for training deep convolutional neural networks. Current approaches parallelize training onto multiple devices by applying a single parallelization strategy (e.g.,…

Machine Learning · Computer Science 2018-06-12 Zhihao Jia , Sina Lin , Charles R. Qi , Alex Aiken

Distributed Training Large-Scale Deep Architectures

Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-09-21 Shang-Xuan Zou , Chun-Yen Chen , Jui-Lin Wu , Chun-Nan Chou , Chia-Chin Tsao , Kuan-Chieh Tung , Ting-Wei Lin , Cheng-Lung Sung , Edward Y. Chang

A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs

Heavy communication, in particular, collective operations, can become a critical performance bottleneck in scaling the training of billion-parameter neural networks to large-scale parallel systems. This paper introduces a four-dimensional…

Machine Learning · Computer Science 2024-05-15 Siddharth Singh , Prajwal Singhania , Aditya K. Ranjan , Zack Sating , Abhinav Bhatele

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

The success of Transformer models has pushed the deep learning model scale to billions of parameters. Due to the limited memory resource of a single GPU, However, the best practice for choosing the optimal parallel strategy is still…

Machine Learning · Computer Science 2023-10-06 Shenggui Li , Hongxin Liu , Zhengda Bian , Jiarui Fang , Haichen Huang , Yuliang Liu , Boxiang Wang , Yang You

Accelerating GAN training using highly parallel hardware on public cloud

With the increasing number of Machine and Deep Learning applications in High Energy Physics, easy access to dedicated infrastructure represents a requirement for fast and efficient R&D. This work explores different types of cloud services…

Machine Learning · Computer Science 2021-11-09 Renato Cardoso , Dejan Golubovic , Ignacio Peluaga Lozada , Ricardo Rocha , João Fernandes , Sofia Vallecorsa

Saturn: Efficient Multi-Large-Model Deep Learning

In this paper, we propose Saturn, a new data system to improve the efficiency of multi-large-model training (e.g., during model selection/hyperparameter optimization). We first identify three key interconnected systems challenges for users…

Machine Learning · Computer Science 2023-11-07 Kabir Nagrecha , Arun Kumar