Related papers: HetSeq: Distributed GPU Training on Heterogeneous …

LightSeq2: Accelerated Training for Transformer-based Models on GPUs

Transformer-based neural models are used in many AI applications. Training these models is expensive, as it takes huge GPU resources and long duration. It is challenging because typical data like sentences have variable lengths, and…

Computation and Language · Computer Science 2022-06-17 Xiaohui Wang , Yang Wei , Ying Xiong , Guyue Huang , Xian Qian , Yufei Ding , Mingxuan Wang , Lei Li

HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware

Long-context training of large language models (LLMs) is commonly distributed with Context Parallelism (CP) and Head Parallelism (HP), but existing training systems largely assume homogeneous GPU meshes. This paper extends CP and HP to…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-11 Yan Liang , Youhe Jiang , Ran Yan , Binhang Yuan , Wei Wang , Chuan Wu

HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models

Training large-scale models relies on a vast number of computing resources. For example, training the GPT-4 model (1.8 trillion parameters) requires 25000 A100 GPUs . It is a challenge to build a large-scale cluster with one type of…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-12 Si Xu , Zixiao Huang , Yan Zeng , Shengen Yan , Xuefei Ning , Quanlu Zhang , Haolin Ye , Sipei Gu , Chunsheng Shui , Zhezheng Lin , Hao Zhang , Sheng Wang , Guohao Dai , Yu Wang

fairseq: A Fast, Extensible Toolkit for Sequence Modeling

fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks. The toolkit is based on PyTorch and…

Computation and Language · Computer Science 2019-04-03 Myle Ott , Sergey Edunov , Alexei Baevski , Angela Fan , Sam Gross , Nathan Ng , David Grangier , Michael Auli

LightSeq: A High Performance Inference Library for Transformers

Transformer, BERT and their variants have achieved great success in natural language processing. Since Transformer models are huge in size, serving these models is a challenge for real industrial applications. In this paper, we propose…

Mathematical Software · Computer Science 2021-04-23 Xiaohui Wang , Ying Xiong , Yang Wei , Mingxuan Wang , Lei Li

Scalability Evaluation of HPC Multi-GPU Training for ECG-based LLMs

Training large language models requires extensive processing, made possible by many high-performance computing resources. This study compares multi-node and multi-GPU environments for training large language models of electrocardiograms. It…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-28 Dimitar Mileski , Nikola Petrovski , Marjan Gusev

Optimizing High-Throughput Distributed Data Pipelines for Reproducible Deep Learning at Scale

Training massive-scale deep learning models on datasets spanning tens of terabytes presents critical challenges in hardware utilization and training reproducibility. In this paper, we identify and resolve profound data-loading bottlenecks…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-24 Kashish Mittal , Di Yu , Roozbeh Ketabi , Arushi Arora , Brendon Lapp , Peng Zhang

HexiScale: Facilitating Large Language Model Training over Heterogeneous Hardware

Training large language models (LLMs) is a computationally intensive task, which is typically conducted in data centers with homogeneous high-performance GPUs. In this paper, we explore an alternative approach by deploying training…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-14 Ran Yan , Youhe Jiang , Xiaonan Nie , Fangcheng Fu , Bin Cui , Binhang Yuan

Cephalo: Harnessing Heterogeneous GPU Clusters for Training Transformer Models

Training transformer models requires substantial GPU compute and memory resources. In homogeneous clusters, distributed strategies allocate resources evenly, but this approach is inefficient for heterogeneous clusters, where GPUs differ in…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-15 Runsheng Benson Guo , Utkarsh Anand , Arthur Chen , Khuzaima Daudjee

HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments

As large language models (LLMs) continue to scale and new GPUs are released even more frequently, there is an increasing demand for LLM post-training in heterogeneous environments to fully leverage underutilized mid-range or…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-14 Yongjun He , Shuai Zhang , Jiading Gai , Xiyuan Zhang , Boran Han , Bernie Wang , Huzefa Rangwala , George Karypis

DeepCEE: Efficient Cross-Region Model Distributed Training System under Heterogeneous GPUs and Networks

Most existing training systems focus on a single region. In contrast, we envision that cross-region training offers more flexible GPU resource allocation and yields significant potential. However, the hierarchical cluster topology and…

Systems and Control · Electrical Eng. & Systems 2025-05-28 Jinquan Wang , Xiaojian Liao , Xuzhao Liu , Jiashun Suo , Zhisheng Huo , Chenhao Zhang , Xiangrong Xu , Runnan Shen , Xilong Xie , Limin Xiao

VirtualFlow: Decoupling Deep Learning Models from the Underlying Hardware

State-of-the-art deep learning systems such as TensorFlow and PyTorch tightly couple the model with the underlying hardware. This coupling requires the user to modify application logic in order to run the same job across a different set of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-13 Andrew Or , Haoyu Zhang , Michael J. Freedman

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-30 Shen Li , Yanli Zhao , Rohan Varma , Omkar Salpekar , Pieter Noordhuis , Teng Li , Adam Paszke , Jeff Smith , Brian Vaughan , Pritam Damania , Soumith Chintala

HetCCL: Accelerating LLM Training with Heterogeneous GPUs

The rapid growth of large language models is driving organizations to expand their GPU clusters, often with GPUs from multiple vendors. However, current deep learning frameworks lack support for collective communication across heterogeneous…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-02 Heehoon Kim , Jaehwan Lee , Taejeoung Kim , Jongwon Park , Jinpyo Kim , Pyongwon Suh , Ryan H. Choi , Sangwoo Lee , Jaejin Lee

Heterogeneous CPU+GPU Stochastic Gradient Descent Algorithms

The widely-adopted practice is to train deep learning models with specialized hardware accelerators, e.g., GPUs or TPUs, due to their superior performance on linear algebra operations. However, this strategy does not employ effectively the…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-21 Yujing Ma , Florin Rusu

Horovod: fast and easy distributed deep learning in TensorFlow

Training modern deep learning models requires large amounts of computation, often provided by GPUs. Scaling computation from one GPU to many can enable much faster training and research progress but entails two complications. First, the…

Machine Learning · Computer Science 2018-02-22 Alexander Sergeev , Mike Del Balso

FlexModel: A Framework for Interpretability of Distributed Large Language Models

With the growth of large language models, now incorporating billions of parameters, the hardware prerequisites for their training and deployment have seen a corresponding increase. Although existing tools facilitate model parallelization…

Machine Learning · Computer Science 2023-12-07 Matthew Choi , Muhammad Adil Asif , John Willes , David Emerson

OmniLearn: A Framework for Distributed Deep Learning over Heterogeneous Clusters

Deep learning systems are optimized for clusters with homogeneous resources. However, heterogeneity is prevalent in computing infrastructure across edge, cloud and HPC. When training neural networks using stochastic gradient descent…

Machine Learning · Computer Science 2025-03-25 Sahil Tyagi , Prateek Sharma

pyGSL: A Graph Structure Learning Toolkit

We introduce pyGSL, a Python library that provides efficient implementations of state-of-the-art graph structure learning models along with diverse datasets to evaluate them on. The implementations are written in GPU-friendly ways, allowing…

Machine Learning · Computer Science 2022-11-08 Max Wasserman , Gonzalo Mateos

HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System

As giant dense models advance quality but require large amounts of GPU budgets for training, the sparsely gated Mixture-of-Experts (MoE), a kind of conditional computation architecture, is proposed to scale models while keeping their…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-18 Xiaonan Nie , Pinxue Zhao , Xupeng Miao , Tong Zhao , Bin Cui