English
Related papers

Related papers: HetSeq: Distributed GPU Training on Heterogeneous …

200 papers

Transformer-based neural models are used in many AI applications. Training these models is expensive, as it takes huge GPU resources and long duration. It is challenging because typical data like sentences have variable lengths, and…

Computation and Language · Computer Science 2022-06-17 Xiaohui Wang , Yang Wei , Ying Xiong , Guyue Huang , Xian Qian , Yufei Ding , Mingxuan Wang , Lei Li

Long-context training of large language models (LLMs) is commonly distributed with Context Parallelism (CP) and Head Parallelism (HP), but existing training systems largely assume homogeneous GPU meshes. This paper extends CP and HP to…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-11 Yan Liang , Youhe Jiang , Ran Yan , Binhang Yuan , Wei Wang , Chuan Wu

Training large-scale models relies on a vast number of computing resources. For example, training the GPT-4 model (1.8 trillion parameters) requires 25000 A100 GPUs . It is a challenge to build a large-scale cluster with one type of…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-12 Si Xu , Zixiao Huang , Yan Zeng , Shengen Yan , Xuefei Ning , Quanlu Zhang , Haolin Ye , Sipei Gu , Chunsheng Shui , Zhezheng Lin , Hao Zhang , Sheng Wang , Guohao Dai , Yu Wang

fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks. The toolkit is based on PyTorch and…

Computation and Language · Computer Science 2019-04-03 Myle Ott , Sergey Edunov , Alexei Baevski , Angela Fan , Sam Gross , Nathan Ng , David Grangier , Michael Auli

Transformer, BERT and their variants have achieved great success in natural language processing. Since Transformer models are huge in size, serving these models is a challenge for real industrial applications. In this paper, we propose…

Mathematical Software · Computer Science 2021-04-23 Xiaohui Wang , Ying Xiong , Yang Wei , Mingxuan Wang , Lei Li

Training large language models requires extensive processing, made possible by many high-performance computing resources. This study compares multi-node and multi-GPU environments for training large language models of electrocardiograms. It…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-28 Dimitar Mileski , Nikola Petrovski , Marjan Gusev

Training massive-scale deep learning models on datasets spanning tens of terabytes presents critical challenges in hardware utilization and training reproducibility. In this paper, we identify and resolve profound data-loading bottlenecks…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-24 Kashish Mittal , Di Yu , Roozbeh Ketabi , Arushi Arora , Brendon Lapp , Peng Zhang

Training large language models (LLMs) is a computationally intensive task, which is typically conducted in data centers with homogeneous high-performance GPUs. In this paper, we explore an alternative approach by deploying training…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-14 Ran Yan , Youhe Jiang , Xiaonan Nie , Fangcheng Fu , Bin Cui , Binhang Yuan

Training transformer models requires substantial GPU compute and memory resources. In homogeneous clusters, distributed strategies allocate resources evenly, but this approach is inefficient for heterogeneous clusters, where GPUs differ in…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-15 Runsheng Benson Guo , Utkarsh Anand , Arthur Chen , Khuzaima Daudjee

As large language models (LLMs) continue to scale and new GPUs are released even more frequently, there is an increasing demand for LLM post-training in heterogeneous environments to fully leverage underutilized mid-range or…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-14 Yongjun He , Shuai Zhang , Jiading Gai , Xiyuan Zhang , Boran Han , Bernie Wang , Huzefa Rangwala , George Karypis

Most existing training systems focus on a single region. In contrast, we envision that cross-region training offers more flexible GPU resource allocation and yields significant potential. However, the hierarchical cluster topology and…

Systems and Control · Electrical Eng. & Systems 2025-05-28 Jinquan Wang , Xiaojian Liao , Xuzhao Liu , Jiashun Suo , Zhisheng Huo , Chenhao Zhang , Xiangrong Xu , Runnan Shen , Xilong Xie , Limin Xiao

State-of-the-art deep learning systems such as TensorFlow and PyTorch tightly couple the model with the underlying hardware. This coupling requires the user to modify application logic in order to run the same job across a different set of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-13 Andrew Or , Haoyu Zhang , Michael J. Freedman

This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-30 Shen Li , Yanli Zhao , Rohan Varma , Omkar Salpekar , Pieter Noordhuis , Teng Li , Adam Paszke , Jeff Smith , Brian Vaughan , Pritam Damania , Soumith Chintala

The rapid growth of large language models is driving organizations to expand their GPU clusters, often with GPUs from multiple vendors. However, current deep learning frameworks lack support for collective communication across heterogeneous…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-02 Heehoon Kim , Jaehwan Lee , Taejeoung Kim , Jongwon Park , Jinpyo Kim , Pyongwon Suh , Ryan H. Choi , Sangwoo Lee , Jaejin Lee

The widely-adopted practice is to train deep learning models with specialized hardware accelerators, e.g., GPUs or TPUs, due to their superior performance on linear algebra operations. However, this strategy does not employ effectively the…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-21 Yujing Ma , Florin Rusu

Training modern deep learning models requires large amounts of computation, often provided by GPUs. Scaling computation from one GPU to many can enable much faster training and research progress but entails two complications. First, the…

Machine Learning · Computer Science 2018-02-22 Alexander Sergeev , Mike Del Balso

With the growth of large language models, now incorporating billions of parameters, the hardware prerequisites for their training and deployment have seen a corresponding increase. Although existing tools facilitate model parallelization…

Machine Learning · Computer Science 2023-12-07 Matthew Choi , Muhammad Adil Asif , John Willes , David Emerson

Deep learning systems are optimized for clusters with homogeneous resources. However, heterogeneity is prevalent in computing infrastructure across edge, cloud and HPC. When training neural networks using stochastic gradient descent…

Machine Learning · Computer Science 2025-03-25 Sahil Tyagi , Prateek Sharma

We introduce pyGSL, a Python library that provides efficient implementations of state-of-the-art graph structure learning models along with diverse datasets to evaluate them on. The implementations are written in GPU-friendly ways, allowing…

Machine Learning · Computer Science 2022-11-08 Max Wasserman , Gonzalo Mateos

As giant dense models advance quality but require large amounts of GPU budgets for training, the sparsely gated Mixture-of-Experts (MoE), a kind of conditional computation architecture, is proposed to scale models while keeping their…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-18 Xiaonan Nie , Pinxue Zhao , Xupeng Miao , Tong Zhao , Bin Cui
‹ Prev 1 2 3 10 Next ›