English
Related papers

Related papers: Galvatron: Efficient Transformer Training over Mul…

200 papers

Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains, serving as the foundation for advanced large-scale deep learning (DL) models. However, efficiently…

Machine Learning · Computer Science 2024-09-06 Yujie Wang , Youhe Jiang , Xupeng Miao , Fangcheng Fu , Shenhan Zhu , Xiaonan Nie , Yaofeng Tu , Bin Cui

Training multi-billion to trillion-parameter language models efficiently on GPU clusters requires leveraging multiple parallelism strategies. We present Galvatron, a novel open-source framework (dubbed 'Optimus-Megatron' in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-08 Esmail Gumaan

Galvatron is a distributed system for efficiently training large-scale Foundation Models. It overcomes the complexities of selecting optimal parallelism strategies by automatically identifying the most efficient hybrid strategy,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-01 Xinyi Liu , Yujie Wang , Shenhan Zhu , Fangcheng Fu , Qingshuo Liu , Guangming Lin , Bin Cui

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on…

Deep learning (DL) has transformed applications in a variety of domains, including computer vision, natural language processing, and tabular data analysis. The search for improved DL model accuracy has led practitioners to explore…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-10 Kabir Nagrecha

Graph foundation models have demonstrated remarkable adaptability across diverse downstream tasks through large-scale pretraining on graphs. However, existing implementations of the backbone model, graph transformers, are typically limited…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-21 Jun-Liang Lin , Kamesh Madduri , Mahmut Taylan Kandemir

Deploying deep learning (DL) models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used…

Machine Learning · Computer Science 2022-11-08 Saptadeep Pal , Eiman Ebrahimi , Arslan Zulfiqar , Yaosheng Fu , Victor Zhang , Szymon Migacz , David Nellans , Puneet Gupta

Many of the most performant deep learning models today in fields like language and image understanding are fine-tuned models that contain billions of parameters. In anticipation of workloads that involve serving many of such large models to…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-27 Daniel Zou , Xinchen Jin , Xueyang Yu , Hao Zhang , James Demmel

The rapid advancement in Large Language Models has been met with significant challenges in their training processes, primarily due to their considerable computational and memory demands. This research examines parallelization techniques…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-27 Ishan Patwardhan , Shubham Gandhi , Om Khare , Amit Joshi , Suraj Sawant

Training deep neural networks on large scientific data is a challenging task that requires enormous compute power, especially if no pre-trained models exist to initialize the process. We present a novel tournament method to train…

Training foundation models, such as GPT-3 and PaLM, can be extremely expensive, often involving tens of thousands of GPUs running continuously for months. These models are typically trained in specialized clusters featuring fast,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-22 Binhang Yuan , Yongjun He , Jared Quincy Davis , Tianyi Zhang , Tri Dao , Beidi Chen , Percy Liang , Christopher Re , Ce Zhang

The advent of the Transformer architecture has propelled the growth of natural language processing (NLP) models, leading to remarkable achievements in numerous NLP tasks. Yet, the absence of specialized hardware like expansive GPU memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-18 Xiaofeng Wu , Jia Rao , Wei Chen

The training process of Deep Neural Network (DNN) is compute-intensive, often taking days to weeks to train a DNN model. Therefore, parallel execution of DNN training on GPUs is a widely adopted approach to speed up the process nowadays.…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-29 Chi-Chung Chen , Chia-Lin Yang , Hsiang-Yun Cheng

The increasing complexity of deep learning recommendation models (DLRM) has led to a growing need for large-scale distributed systems that can efficiently train vast amounts of data. In DLRM, the sparse embedding table is a crucial…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-07 Xin Zhang , Quanyu Zhu , Liangbei Xu , Zain Huda , Wang Zhou , Jin Fang , Dennis van der Staay , Yuxi Hu , Jade Nie , Jiyan Yang , Chunzhi Yang

The past few years have witnessed growth in the computational requirements for training deep convolutional neural networks. Current approaches parallelize training onto multiple devices by applying a single parallelization strategy (e.g.,…

Machine Learning · Computer Science 2018-06-12 Zhihao Jia , Sina Lin , Charles R. Qi , Alex Aiken

Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-09-21 Shang-Xuan Zou , Chun-Yen Chen , Jui-Lin Wu , Chun-Nan Chou , Chia-Chin Tsao , Kuan-Chieh Tung , Ting-Wei Lin , Cheng-Lung Sung , Edward Y. Chang

Heavy communication, in particular, collective operations, can become a critical performance bottleneck in scaling the training of billion-parameter neural networks to large-scale parallel systems. This paper introduces a four-dimensional…

Machine Learning · Computer Science 2024-05-15 Siddharth Singh , Prajwal Singhania , Aditya K. Ranjan , Zack Sating , Abhinav Bhatele

The success of Transformer models has pushed the deep learning model scale to billions of parameters. Due to the limited memory resource of a single GPU, However, the best practice for choosing the optimal parallel strategy is still…

Machine Learning · Computer Science 2023-10-06 Shenggui Li , Hongxin Liu , Zhengda Bian , Jiarui Fang , Haichen Huang , Yuliang Liu , Boxiang Wang , Yang You

With the increasing number of Machine and Deep Learning applications in High Energy Physics, easy access to dedicated infrastructure represents a requirement for fast and efficient R&D. This work explores different types of cloud services…

Machine Learning · Computer Science 2021-11-09 Renato Cardoso , Dejan Golubovic , Ignacio Peluaga Lozada , Ricardo Rocha , João Fernandes , Sofia Vallecorsa

In this paper, we propose Saturn, a new data system to improve the efficiency of multi-large-model training (e.g., during model selection/hyperparameter optimization). We first identify three key interconnected systems challenges for users…

Machine Learning · Computer Science 2023-11-07 Kabir Nagrecha , Arun Kumar
‹ Prev 1 2 3 10 Next ›