Related papers: Breaking MLPerf Training: A Case Study on Optimizi…

Boosting Distributed Training Performance of the Unpadded BERT Model

Pre-training models are an important tool in Natural Language Processing (NLP), while the BERT model is a classic pre-training model whose structure has been widely adopted by followers. It was even chosen as the reference model for the…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-18 Jinle Zeng , Min Li , Zhihua Wu , Jiaqi Liu , Yuang Liu , Dianhai Yu , Yanjun Ma

1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed

Scalable training of large models (like BERT and GPT-3) requires careful optimization rooted in model design, architecture, and system capabilities. From a system standpoint, communication has become a major bottleneck, especially on…

Machine Learning · Computer Science 2021-07-01 Hanlin Tang , Shaoduo Gan , Ammar Ahmad Awan , Samyam Rajbhandari , Conglong Li , Xiangru Lian , Ji Liu , Ce Zhang , Yuxiong He

Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes

BERT has recently attracted a lot of attention in natural language understanding (NLU) and achieved state-of-the-art results in various NLU tasks. However, its success requires large deep neural networks and huge amount of data, which…

Machine Learning · Computer Science 2020-09-21 Shuai Zheng , Haibin Lin , Sheng Zha , Mu Li

Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for BERT Training Speedup

Pre-trained language models, such as BERT, have achieved significant accuracy gain in many natural language processing tasks. Despite its effectiveness, the huge number of parameters makes training a BERT model computationally very…

Computation and Language · Computer Science 2020-11-30 Cheng Yang , Shengnan Wang , Chao Yang , Yuechuan Li , Ru He , Jingqiao Zhang

Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam

1-bit gradient compression and local steps are two representative techniques that enable drastic communication reduction in distributed SGD. Their benefits, however, remain an open question on Adam-based large model pre-training (e.g. BERT…

Machine Learning · Computer Science 2022-05-24 Yucheng Lu , Conglong Li , Minjia Zhang , Christopher De Sa , Yuxiong He

Estudio de la eficiencia en la escalabilidad de GPUs para el entrenamiento de Inteligencia Artificial

Training large-scale deep learning models has become a key challenge for the scientific community and industry. While the massive use of GPUs can significantly speed up training times, this approach has a negative impact on efficiency. In…

Machine Learning · Computer Science 2025-09-04 David Cortes , Carlos Juiz , Belen Bermejo

Multi-node Bert-pretraining: Cost-efficient Approach

Recently, large scale Transformer-based language models such as BERT, GPT-2, and XLNet have brought about exciting leaps in state-of-the-art results for many Natural Language Processing (NLP) tasks. One of the common trends in these recent…

Machine Learning · Computer Science 2020-08-04 Jiahuang Lin , Xin Li , Gennady Pekhimenko

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Training large deep neural networks on massive datasets is computationally very challenging. There has been recent surge in interest in using large batch stochastic optimization methods to tackle this issue. The most prominent algorithm in…

Machine Learning · Computer Science 2020-01-06 Yang You , Jing Li , Sashank Reddi , Jonathan Hseu , Sanjiv Kumar , Srinadh Bhojanapalli , Xiaodan Song , James Demmel , Kurt Keutzer , Cho-Jui Hsieh

1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed

To train large models (like BERT and GPT-3) on hundreds of GPUs, communication has become a major bottleneck, especially on commodity systems with limited-bandwidth TCP network. On one side large batch-size optimization such as LAMB…

Machine Learning · Computer Science 2021-10-07 Conglong Li , Ammar Ahmad Awan , Hanlin Tang , Samyam Rajbhandari , Yuxiong He

CAME: Confidence-guided Adaptive Memory Efficient Optimization

Adaptive gradient methods, such as Adam and LAMB, have demonstrated excellent performance in the training of large language models. Nevertheless, the need for adaptivity requires maintaining second-moment estimates of the per-parameter…

Computation and Language · Computer Science 2023-08-08 Yang Luo , Xiaozhe Ren , Zangwei Zheng , Zhuo Jiang , Xin Jiang , Yang You

Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

Dramatic increases in the capabilities of neural network models in recent years are driven by scaling model size, training data, and corresponding computational resources. To develop the exceedingly large networks required in modern…

Machine Learning · Computer Science 2025-04-15 Jared Fernandez , Luca Wehrstedt , Leonid Shamis , Mostafa Elhoushi , Kalyan Saladi , Yonatan Bisk , Emma Strubell , Jacob Kahn

MML: Maximal Multiverse Learning for Robust Fine-Tuning of Language Models

Recent state-of-the-art language models utilize a two-phase training procedure comprised of (i) unsupervised pre-training on unlabeled text, and (ii) fine-tuning for a specific supervised task. More recently, many studies have been focused…

Computation and Language · Computer Science 2019-11-15 Itzik Malkiel , Lior Wolf

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

It is important to scale out deep neural network (DNN) training for reducing model training time. The high communication overhead is one of the major performance bottlenecks for distributed DNN training across multiple GPUs. Our…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-23 Peng Sun , Wansen Feng , Ruobing Han , Shengen Yan , Yonggang Wen

Optimizing Distributed Training on Frontier for Large Language Models

Large language models (LLMs) have demonstrated remarkable success as foundational models, benefiting various downstream applications through fine-tuning. Recent studies on loss scaling have demonstrated the superior performance of larger…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-25 Sajal Dash , Isaac Lyngaas , Junqi Yin , Xiao Wang , Romain Egele , Guojing Cong , Feiyi Wang , Prasanna Balaprakash

MERIT: Maximum-normalized Element-wise Ratio for Language Model Large-batch Training

Large-batch training has become a cornerstone in accelerating the training of deep neural networks, yet it poses challenges in optimization and generalization. Existing optimizers like AdamW present performance degradation during language…

Machine Learning · Computer Science 2025-08-29 Yang Luo , Zangwei Zheng , Ziheng Qin , Zirui Zhu , Yong Liu , Yang You

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer…

Computation and Language · Computer Science 2020-02-11 Zhenzhong Lan , Mingda Chen , Sebastian Goodman , Kevin Gimpel , Piyush Sharma , Radu Soricut

A Quantitative Survey of Communication Optimizations in Distributed Deep Learning

Nowadays, large and complex deep learning (DL) models are increasingly trained in a distributed manner across multiple worker machines, in which extensive communications between workers pose serious scaling problems. In this article, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-10 Shaohuai Shi , Zhenheng Tang , Xiaowen Chu , Chengjian Liu , Wei Wang , Bo Li

DreamDDP: Accelerating Data Parallel Distributed LLM Training with Layer-wise Scheduled Partial Synchronization

The growth of large language models (LLMs) increases challenges of accelerating distributed training across multiple GPUs in different data centers. Moreover, concerns about data privacy and data exhaustion have heightened interest in…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-18 Zhenheng Tang , Zichen Tang , Junlin Huang , Xinglin Pan , Rudan Yan , Yuxin Wang , Amelie Chi Zhou , Shaohuai Shi , Xiaowen Chu , Bo Li

MT-DAO: Multi-Timescale Distributed Adaptive Optimizers with Local Updates

Training large models with distributed data parallelism (DDP) requires frequent communication of gradients across workers, which can saturate bandwidth. Infrequent communication strategies (e.g., Local SGD) reduce this overhead but, when…

Machine Learning · Computer Science 2025-10-08 Alex Iacob , Andrej Jovanovic , Mher Safaryan , Meghdad Kurmanji , Lorenzo Sani , Samuel Horváth , William F. Shen , Xinchi Qiu , Nicholas D. Lane

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Aligning future system design with the ever-increasing compute needs of large language models (LLMs) is undoubtedly an important problem in today's world. Here, we propose a general performance modeling methodology and workload analysis of…

Hardware Architecture · Computer Science 2024-07-23 Joyjit Kundu , Wenzhe Guo , Ali BanaGozar , Udari De Alwis , Sourav Sengupta , Puneet Gupta , Arindam Mallik