Related papers: Improving Automatic Parallel Training via Balanced…

Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

Transformer models have achieved state-of-the-art performance on various domains of applications and gradually becomes the foundations of the advanced large deep learning (DL) models. However, how to train these models over multiple GPUs…

Machine Learning · Computer Science 2022-11-28 Xupeng Miao , Yujie Wang , Youhe Jiang , Chunan Shi , Xiaonan Nie , Hailin Zhang , Bin Cui

Galvatron: Automatic Distributed Training for Large Transformer Models

Training multi-billion to trillion-parameter language models efficiently on GPU clusters requires leveraging multiple parallelism strategies. We present Galvatron, a novel open-source framework (dubbed 'Optimus-Megatron' in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-08 Esmail Gumaan

Galvatron: An Automatic Distributed System for Efficient Foundation Model Training

Galvatron is a distributed system for efficiently training large-scale Foundation Models. It overcomes the complexities of selecting optimal parallelism strategies by automatically identifying the most efficient hybrid strategy,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-01 Xinyi Liu , Yujie Wang , Shenhan Zhu , Fangcheng Fu , Qingshuo Liu , Guangming Lin , Bin Cui

Scaling Studies for Efficient Parameter Search and Parallelism for Large Language Model Pre-training

AI accelerator processing capabilities and memory constraints largely dictate the scale in which machine learning workloads (e.g., training and inference) can be executed within a desirable time frame. Training a state of the art,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-12 Michael Benington , Leo Phan , Chris Pierre Paul , Evan Shoemaker , Priyanka Ranade , Torstein Collett , Grant Hodgson Perez , Christopher Krieger

Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training

Deploying deep learning (DL) models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used…

Machine Learning · Computer Science 2022-11-08 Saptadeep Pal , Eiman Ebrahimi , Arslan Zulfiqar , Yaosheng Fu , Victor Zhang , Szymon Migacz , David Nellans , Puneet Gupta

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on…

Computation and Language · Computer Science 2021-08-25 Deepak Narayanan , Mohammad Shoeybi , Jared Casper , Patrick LeGresley , Mostofa Patwary , Vijay Anand Korthikanti , Dmitri Vainbrand , Prethvi Kashinkunti , Julie Bernauer , Bryan Catanzaro , Amar Phanishayee , Matei Zaharia

DFLOP: A Data-driven Framework for Multimodal LLM Training Pipeline Optimization

Multimodal Large Language Models (MLLMs) have achieved remarkable advances by integrating text, image, and audio understanding within a unified architecture. However, existing distributed training frameworks remain fundamentally data-blind:…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-20 Hyeonjun An , Sihyun Kim , Chaerim Lim , Hyunjoon Kim , Rathijit Sen , Sangmin Jung , Hyeonsoo Lee , Dongwook Kim , Takki Yu , Jinkyu Jeong , Youngsok Kim , Kwanghyun Park

Systems for Parallel and Distributed Large-Model Deep Learning Training

Deep learning (DL) has transformed applications in a variety of domains, including computer vision, natural language processing, and tabular data analysis. The search for improved DL model accuracy has led practitioners to explore…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-10 Kabir Nagrecha

ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment

The advent of the Transformer architecture has propelled the growth of natural language processing (NLP) models, leading to remarkable achievements in numerous NLP tasks. Yet, the absence of specialized hardware like expansive GPU memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-18 Xiaofeng Wu , Jia Rao , Wei Chen

On the Performance and Memory Footprint of Distributed Training: An Empirical Study on Transformers

Transformer models have emerged as potent solutions to a wide array of multidisciplinary challenges. The deployment of Transformer architectures is significantly hindered by their extensive computational and memory requirements,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-03 Zhengxian Lu , Fangyu Wang , Zhiwei Xu , Fei Yang , Tao Li

Two-dimensional Sparse Parallelism for Large Scale Deep Learning Recommendation Model Training

The increasing complexity of deep learning recommendation models (DLRM) has led to a growing need for large-scale distributed systems that can efficiently train vast amounts of data. In DLRM, the sparse embedding table is a crucial…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-07 Xin Zhang , Quanyu Zhu , Liangbei Xu , Zain Huda , Wang Zhou , Jin Fang , Dennis van der Staay , Yuxi Hu , Jade Nie , Jiyan Yang , Chunzhi Yang

Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads

The last decade has witnessed growth in the computational requirements for training deep neural networks. Current approaches (e.g., data/model parallelism, pipeline parallelism) parallelize training tasks onto multiple devices. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-09 Siyu Wang , Yi Rong , Shiqing Fan , Zhen Zheng , LanSong Diao , Guoping Long , Jun Yang , Xiaoyong Liu , Wei Lin

DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism

Scaling long-context capabilities is crucial for Multimodal Large Language Models (MLLMs). However, real-world multimodal datasets are extremely heterogeneous. Existing training frameworks predominantly rely on static parallelism…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-26 Yifan Niu , Han Xiao , Dongyi Liu , Wei Zhou , Jia Li

Rethinking Dynamic Networks and Heterogeneous Computing with Automatic Parallelization

Hybrid parallelism techniques are essential for efficiently training large language models (LLMs). Nevertheless, current automatic parallel planning frameworks often overlook the simultaneous consideration of node heterogeneity and dynamic…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-04 Ruilong Wu , Xinjiao Li , Yisu Wang , Xinyu Chen , Dirk Kutscher

G-Meta: Distributed Meta Learning in GPU Clusters for Large-Scale Recommender Systems

Recently, a new paradigm, meta learning, has been widely applied to Deep Learning Recommendation Models (DLRM) and significantly improves statistical performance, especially in cold-start scenarios. However, the existing systems are not…

Machine Learning · Computer Science 2024-04-16 Youshao Xiao , Shangchun Zhao , Zhenglei Zhou , Zhaoxin Huan , Lin Ju , Xiaolu Zhang , Lin Wang , Jun Zhou

AutoDDL: Automatic Distributed Deep Learning with Near-Optimal Bandwidth Cost

Recent advances in deep learning are driven by the growing scale of computation, data, and models. However, efficiently training large-scale models on distributed systems requires an intricate combination of data, operator, and pipeline…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-22 Jinfan Chen , Shigang Li , Ran Gun , Jinhui Yuan , Torsten Hoefler

A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

Recently, Deep Neural Networks (DNNs) have recorded great success in handling medical and other complex classification tasks. However, as the sizes of a DNN model and the available dataset increase, the training process becomes more complex…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-08 Samson B. Akintoye , Liangxiu Han , Xin Zhang , Haoming Chen , Daoqiang Zhang

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-03 Shiqing Fan , Yi Rong , Chen Meng , Zongyan Cao , Siyu Wang , Zhen Zheng , Chuan Wu , Guoping Long , Jun Yang , Lixue Xia , Lansong Diao , Xiaoyong Liu , Wei Lin

Automatic Operator-level Parallelism Planning for Distributed Deep Learning -- A Mixed-Integer Programming Approach

As the artificial intelligence community advances into the era of large models with billions of parameters, distributed training and inference have become essential. While various parallelism strategies-data, model, sequence, and…

Machine Learning · Computer Science 2025-03-13 Ruifeng She , Bowen Pang , Kai Li , Zehua Liu , Tao Zhong

A Comparative Analysis of Distributed Training Strategies for GPT-2

The rapid advancement in Large Language Models has been met with significant challenges in their training processes, primarily due to their considerable computational and memory demands. This research examines parallelization techniques…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-27 Ishan Patwardhan , Shubham Gandhi , Om Khare , Amit Joshi , Suraj Sawant