English
Related papers

Related papers: Megatron-LM: Training Multi-Billion Parameter Lang…

200 papers

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on…

Large language models (LLMs) have demonstrated remarkable success as foundational models, benefiting various downstream applications through fine-tuning. Recent studies on loss scaling have demonstrated the superior performance of larger…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-25 Sajal Dash , Isaac Lyngaas , Junqi Yin , Xiao Wang , Romain Egele , Guojing Cong , Feiyi Wang , Prasanna Balaprakash

We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings…

The advent of the Transformer architecture has propelled the growth of natural language processing (NLP) models, leading to remarkable achievements in numerous NLP tasks. Yet, the absence of specialized hardware like expansive GPU memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-18 Xiaofeng Wu , Jia Rao , Wei Chen

Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success,…

Large-scale transformer models have shown remarkable performance in language modelling tasks. However, such models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch. To…

Artificial Intelligence · Computer Science 2023-06-06 Viktoriia Chekalina , Georgii Novikov , Julia Gusak , Ivan Oseledets , Alexander Panchenko

The rapid advancement in Large Language Models has been met with significant challenges in their training processes, primarily due to their considerable computational and memory demands. This research examines parallelization techniques…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-27 Ishan Patwardhan , Shubham Gandhi , Om Khare , Amit Joshi , Suraj Sawant

Large-scale Transformer models have significantly promoted the recent development of natural language processing applications. However, little effort has been made to unify the effective models. In this paper, driven by providing a new set…

Computation and Language · Computer Science 2022-04-12 Dezhou Shen

Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism…

Machine Learning · Computer Science 2021-09-29 Zhuohan Li , Siyuan Zhuang , Shiyuan Guo , Danyang Zhuo , Hao Zhang , Dawn Song , Ion Stoica

Large foundation language models have shown their versatility in being able to be adapted to perform a wide variety of downstream tasks, such as text generation, sentiment analysis, semantic search etc. However, training such large…

Machine Learning · Computer Science 2023-04-13 Venkat Srinivasan , Darshan Gandhi , Urmish Thakker , Raghu Prabhakar

Large Language Models (LLMs) continue to demonstrate superior performance with increasing scale, yet training models with billions to trillions of parameters requires staggering computational resources, e.g. a one-trillion-parameter…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-11 Ajay Navilarekal Rajgopal , Nikolai Solmsdorf

Large language models (LLMs) require enormous computing power to pretrain on massive datasets. When limited datasets are available, smaller-sized LLMs are better choice to pretrain (on user-specified datasets) by following the scaling laws…

Machine Learning · Computer Science 2026-03-23 Praveen Rao

Training large transformer models is one of the most important computational challenges of modern AI. In this paper, we show how to significantly accelerate training of large transformer models by reducing activation recomputation.…

Machine Learning · Computer Science 2022-05-12 Vijay Korthikanti , Jared Casper , Sangkug Lym , Lawrence McAfee , Michael Andersch , Mohammad Shoeybi , Bryan Catanzaro

Language models are essential for natural language processing (NLP) tasks, such as machine translation and text summarization. Remarkable performance has been demonstrated recently across many NLP domains via a Transformer-based language…

Computation and Language · Computer Science 2019-09-17 Qian Yang , Zhouyuan Huo , Wenlin Wang , Heng Huang , Lawrence Carin

Large language models (LLMs) are computationally intensive. The computation workload and the memory footprint grow quadratically with the dimension (layer width). Most of LLMs' parameters come from the linear layers of the transformer…

Machine Learning · Computer Science 2024-02-22 Xiao-Yang Liu , Jie Zhang , Guoxuan Wang , Weiqing Tong , Anwar Walid

In recent years, researchers tend to pre-train ever-larger language models to explore the upper limit of deep models. However, large language model pre-training costs intensive computational resources and most of the models are trained from…

Computation and Language · Computer Science 2021-10-15 Cheng Chen , Yichun Yin , Lifeng Shang , Xin Jiang , Yujia Qin , Fengyu Wang , Zhi Wang , Xiao Chen , Zhiyuan Liu , Qun Liu

Training multi-billion to trillion-parameter language models efficiently on GPU clusters requires leveraging multiple parallelism strategies. We present Galvatron, a novel open-source framework (dubbed 'Optimus-Megatron' in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-08 Esmail Gumaan

Efficiently training large language models requires parallelizing across hundreds of hardware accelerators and invoking various compute and memory optimizations. When combined, many of these strategies have complex interactions regarding…

Machine Learning · Computer Science 2024-09-25 Johannes Hagemann , Samuel Weinbach , Konstantin Dobler , Maximilian Schall , Gerard de Melo

Large language models (LLMs) show best-in-class performance across a wide range of natural language processing applications. Training these models is an extremely computationally expensive task; frontier Artificial Intelligence (AI)…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-10 Alexander Interrante-Grant , Carla Varela-Rosa , Suhaas Narayan , Chris Connelly , Albert Reuther

Many deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. In this work, we consider alternative setups for…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-30 Max Ryabinin , Tim Dettmers , Michael Diskin , Alexander Borzunov
‹ Prev 1 2 3 10 Next ›