Related papers: Optimizing Distributed Training on Frontier for La…

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on…

Computation and Language · Computer Science 2021-08-25 Deepak Narayanan , Mohammad Shoeybi , Jared Casper , Patrick LeGresley , Mostofa Patwary , Vijay Anand Korthikanti , Dmitri Vainbrand , Prethvi Kashinkunti , Julie Bernauer , Bryan Catanzaro , Amar Phanishayee , Matei Zaharia

Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers

Training and fine-tuning large language models (LLMs) with hundreds of billions to trillions of parameters requires tens of thousands of GPUs, and a highly scalable software stack. In this work, we present a novel four-dimensional hybrid…

Machine Learning · Computer Science 2025-02-13 Siddharth Singh , Prajwal Singhania , Aditya Ranjan , John Kirchenbauer , Jonas Geiping , Yuxin Wen , Neel Jain , Abhimanyu Hans , Manli Shu , Aditya Tomar , Tom Goldstein , Abhinav Bhatele

Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

Training Large Language Models(LLMs) is one of the most compute-intensive tasks in high-performance computing. Predicting end-to-end training time for multi-billion parameter models distributed across hundreds of GPUs remains challenging…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-30 Biyao Zhang , Mingkai Zheng , Debargha Ganguly , Xuecen Zhang , Vikash Singh , Vipin Chaudhary , Zhao Zhang

A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

Large Language Models (LLMs) continue to demonstrate superior performance with increasing scale, yet training models with billions to trillions of parameters requires staggering computational resources, e.g. a one-trillion-parameter…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-11 Ajay Navilarekal Rajgopal , Nikolai Solmsdorf

Scaling Performance of Large Language Model Pretraining

Large language models (LLMs) show best-in-class performance across a wide range of natural language processing applications. Training these models is an extremely computationally expensive task; frontier Artificial Intelligence (AI)…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-10 Alexander Interrante-Grant , Carla Varela-Rosa , Suhaas Narayan , Chris Connelly , Albert Reuther

Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

Dramatic increases in the capabilities of neural network models in recent years are driven by scaling model size, training data, and corresponding computational resources. To develop the exceedingly large networks required in modern…

Machine Learning · Computer Science 2025-04-15 Jared Fernandez , Luca Wehrstedt , Leonid Shamis , Mostafa Elhoushi , Kalyan Saladi , Yonatan Bisk , Emma Strubell , Jacob Kahn

Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective

The rapid scaling of Large Language Models (LLMs) has pushed training workloads far beyond the limits of single-node analysis, demanding a deeper understanding of how these models behave across large-scale, multi-GPU systems. In this paper,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-22 Seokjin Go , Joongun Park , Spandan More , Hanjiang Wu , Irene Wang , Aaron Jezghani , Tushar Krishna , Divya Mahajan

Scaling Studies for Efficient Parameter Search and Parallelism for Large Language Model Pre-training

AI accelerator processing capabilities and memory constraints largely dictate the scale in which machine learning workloads (e.g., training and inference) can be executed within a desirable time frame. Training a state of the art,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-12 Michael Benington , Leo Phan , Chris Pierre Paul , Evan Shoemaker , Priyanka Ranade , Torstein Collett , Grant Hodgson Perez , Christopher Krieger

Performance of Small Language Model Pretraining on FABRIC: An Empirical Study

Large language models (LLMs) require enormous computing power to pretrain on massive datasets. When limited datasets are available, smaller-sized LLMs are better choice to pretrain (on user-specified datasets) by following the scaling laws…

Machine Learning · Computer Science 2026-03-23 Praveen Rao

Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning

Scaling up Large Language Model(LLM) training involves fitting a tremendous amount of training parameters across a limited number of workers. However, methods like ZeRO-3 that drastically reduce GPU memory pressure often incur heavy…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-05 Lang Xu , Quentin Anthony , Jacob Hatef , Aamir Shafi , Hari Subramoni , Dhabaleswar K. , Panda

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Aligning future system design with the ever-increasing compute needs of large language models (LLMs) is undoubtedly an important problem in today's world. Here, we propose a general performance modeling methodology and workload analysis of…

Hardware Architecture · Computer Science 2024-07-23 Joyjit Kundu , Wenzhe Guo , Ali BanaGozar , Udari De Alwis , Sourav Sengupta , Puneet Gupta , Arindam Mallik

Comparative Study of Large Language Model Architectures on Frontier

Large language models (LLMs) have garnered significant attention in both the AI community and beyond. Among these, the Generative Pre-trained Transformer (GPT) has emerged as the dominant architecture, spawning numerous variants. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-02-02 Junqi Yin , Avishek Bose , Guojing Cong , Isaac Lyngaas , Quentin Anthony

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings…

Machine Learning · Computer Science 2024-02-27 Ziheng Jiang , Haibin Lin , Yinmin Zhong , Qi Huang , Yangrui Chen , Zhi Zhang , Yanghua Peng , Xiang Li , Cong Xie , Shibiao Nong , Yulu Jia , Sun He , Hongmin Chen , Zhihao Bai , Qi Hou , Shipeng Yan , Ding Zhou , Yiyao Sheng , Zhuo Jiang , Haohan Xu , Haoran Wei , Zhang Zhang , Pengfei Nie , Leqi Zou , Sida Zhao , Liang Xiang , Zherui Liu , Zhe Li , Xiaoying Jia , Jianxi Ye , Xin Jin , Xin Liu

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory…

Computation and Language · Computer Science 2020-03-17 Mohammad Shoeybi , Mostofa Patwary , Raul Puri , Patrick LeGresley , Jared Casper , Bryan Catanzaro

Efficient Training of Large Language Models on Distributed Infrastructures: A Survey

Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with their sophisticated capabilities. Training these models requires vast GPU clusters and significant computing time, posing major challenges in terms of…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-30 Jiangfei Duan , Shuo Zhang , Zerui Wang , Lijuan Jiang , Wenwen Qu , Qinghao Hu , Guoteng Wang , Qizhen Weng , Hang Yan , Xingcheng Zhang , Xipeng Qiu , Dahua Lin , Yonggang Wen , Xin Jin , Tianwei Zhang , Peng Sun

FinGPT-HPC: Efficient Pretraining and Finetuning Large Language Models for Financial Applications with High-Performance Computing

Large language models (LLMs) are computationally intensive. The computation workload and the memory footprint grow quadratically with the dimension (layer width). Most of LLMs' parameters come from the linear layers of the transformer…

Machine Learning · Computer Science 2024-02-22 Xiao-Yang Liu , Jie Zhang , Guoxuan Wang , Weiqing Tong , Anwar Walid

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

Large Language Models (LLMs) have seen great advance in both academia and industry, and their popularity results in numerous open-source frameworks and techniques in accelerating LLM pre-training, fine-tuning, and inference. Training and…

Performance · Computer Science 2023-12-04 Longteng Zhang , Xiang Liu , Zeyu Li , Xinglin Pan , Peijie Dong , Ruibo Fan , Rui Guo , Xin Wang , Qiong Luo , Shaohuai Shi , Xiaowen Chu

A Comparative Analysis of Distributed Training Strategies for GPT-2

The rapid advancement in Large Language Models has been met with significant challenges in their training processes, primarily due to their considerable computational and memory demands. This research examines parallelization techniques…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-27 Ishan Patwardhan , Shubham Gandhi , Om Khare , Amit Joshi , Suraj Sawant

Characterizing Communication Patterns in Distributed Large Language Model Inference

Large Language Models (LLMs) built on transformer architectures have transformed natural language processing, achieving remarkable performance across diverse applications. While distributed inference frameworks enable practical deployment…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-22 Lang Xu , Kaushik Kandadi Suresh , Quentin Anthony , Nawras Alnaasan , Dhabaleswar K. Panda

Efficient Pre-Training of LLMs via Topology-Aware Communication Alignment on More Than 9600 GPUs

The scaling law for large language models (LLMs) depicts that the path towards machine intelligence necessitates training at large scale. Thus, companies continuously build large-scale GPU clusters, and launch training jobs that span over…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-22 Guoliang He , Youhe Jiang , Wencong Xiao , Kaihua Jiang , Shuguang Wang , Jun Wang , Zixian Du , Zhuo Jiang , Xinlei Zhang , Binhang Yuan , Eiko Yoneki