English
Related papers

Related papers: Improving Automatic Parallel Training via Balanced…

200 papers

Transformer models have achieved state-of-the-art performance on various domains of applications and gradually becomes the foundations of the advanced large deep learning (DL) models. However, how to train these models over multiple GPUs…

Machine Learning · Computer Science 2022-11-28 Xupeng Miao , Yujie Wang , Youhe Jiang , Chunan Shi , Xiaonan Nie , Hailin Zhang , Bin Cui

Training multi-billion to trillion-parameter language models efficiently on GPU clusters requires leveraging multiple parallelism strategies. We present Galvatron, a novel open-source framework (dubbed 'Optimus-Megatron' in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-08 Esmail Gumaan

Galvatron is a distributed system for efficiently training large-scale Foundation Models. It overcomes the complexities of selecting optimal parallelism strategies by automatically identifying the most efficient hybrid strategy,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-01 Xinyi Liu , Yujie Wang , Shenhan Zhu , Fangcheng Fu , Qingshuo Liu , Guangming Lin , Bin Cui

AI accelerator processing capabilities and memory constraints largely dictate the scale in which machine learning workloads (e.g., training and inference) can be executed within a desirable time frame. Training a state of the art,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-12 Michael Benington , Leo Phan , Chris Pierre Paul , Evan Shoemaker , Priyanka Ranade , Torstein Collett , Grant Hodgson Perez , Christopher Krieger

Deploying deep learning (DL) models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used…

Machine Learning · Computer Science 2022-11-08 Saptadeep Pal , Eiman Ebrahimi , Arslan Zulfiqar , Yaosheng Fu , Victor Zhang , Szymon Migacz , David Nellans , Puneet Gupta

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on…

Multimodal Large Language Models (MLLMs) have achieved remarkable advances by integrating text, image, and audio understanding within a unified architecture. However, existing distributed training frameworks remain fundamentally data-blind:…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-20 Hyeonjun An , Sihyun Kim , Chaerim Lim , Hyunjoon Kim , Rathijit Sen , Sangmin Jung , Hyeonsoo Lee , Dongwook Kim , Takki Yu , Jinkyu Jeong , Youngsok Kim , Kwanghyun Park

Deep learning (DL) has transformed applications in a variety of domains, including computer vision, natural language processing, and tabular data analysis. The search for improved DL model accuracy has led practitioners to explore…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-10 Kabir Nagrecha

The advent of the Transformer architecture has propelled the growth of natural language processing (NLP) models, leading to remarkable achievements in numerous NLP tasks. Yet, the absence of specialized hardware like expansive GPU memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-18 Xiaofeng Wu , Jia Rao , Wei Chen

Transformer models have emerged as potent solutions to a wide array of multidisciplinary challenges. The deployment of Transformer architectures is significantly hindered by their extensive computational and memory requirements,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-03 Zhengxian Lu , Fangyu Wang , Zhiwei Xu , Fei Yang , Tao Li

The increasing complexity of deep learning recommendation models (DLRM) has led to a growing need for large-scale distributed systems that can efficiently train vast amounts of data. In DLRM, the sparse embedding table is a crucial…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-07 Xin Zhang , Quanyu Zhu , Liangbei Xu , Zain Huda , Wang Zhou , Jin Fang , Dennis van der Staay , Yuxi Hu , Jade Nie , Jiyan Yang , Chunzhi Yang

The last decade has witnessed growth in the computational requirements for training deep neural networks. Current approaches (e.g., data/model parallelism, pipeline parallelism) parallelize training tasks onto multiple devices. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-09 Siyu Wang , Yi Rong , Shiqing Fan , Zhen Zheng , LanSong Diao , Guoping Long , Jun Yang , Xiaoyong Liu , Wei Lin

Scaling long-context capabilities is crucial for Multimodal Large Language Models (MLLMs). However, real-world multimodal datasets are extremely heterogeneous. Existing training frameworks predominantly rely on static parallelism…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-26 Yifan Niu , Han Xiao , Dongyi Liu , Wei Zhou , Jia Li

Hybrid parallelism techniques are essential for efficiently training large language models (LLMs). Nevertheless, current automatic parallel planning frameworks often overlook the simultaneous consideration of node heterogeneity and dynamic…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-04 Ruilong Wu , Xinjiao Li , Yisu Wang , Xinyu Chen , Dirk Kutscher

Recently, a new paradigm, meta learning, has been widely applied to Deep Learning Recommendation Models (DLRM) and significantly improves statistical performance, especially in cold-start scenarios. However, the existing systems are not…

Machine Learning · Computer Science 2024-04-16 Youshao Xiao , Shangchun Zhao , Zhenglei Zhou , Zhaoxin Huan , Lin Ju , Xiaolu Zhang , Lin Wang , Jun Zhou

Recent advances in deep learning are driven by the growing scale of computation, data, and models. However, efficiently training large-scale models on distributed systems requires an intricate combination of data, operator, and pipeline…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-22 Jinfan Chen , Shigang Li , Ran Gun , Jinhui Yuan , Torsten Hoefler

Recently, Deep Neural Networks (DNNs) have recorded great success in handling medical and other complex classification tasks. However, as the sizes of a DNN model and the available dataset increase, the training process becomes more complex…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-08 Samson B. Akintoye , Liangxiu Han , Xin Zhang , Haoming Chen , Daoqiang Zhang

It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-03 Shiqing Fan , Yi Rong , Chen Meng , Zongyan Cao , Siyu Wang , Zhen Zheng , Chuan Wu , Guoping Long , Jun Yang , Lixue Xia , Lansong Diao , Xiaoyong Liu , Wei Lin

As the artificial intelligence community advances into the era of large models with billions of parameters, distributed training and inference have become essential. While various parallelism strategies-data, model, sequence, and…

Machine Learning · Computer Science 2025-03-13 Ruifeng She , Bowen Pang , Kai Li , Zehua Liu , Tao Zhong

The rapid advancement in Large Language Models has been met with significant challenges in their training processes, primarily due to their considerable computational and memory demands. This research examines parallelization techniques…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-27 Ishan Patwardhan , Shubham Gandhi , Om Khare , Amit Joshi , Suraj Sawant
‹ Prev 1 2 3 10 Next ›