English
Related papers

Related papers: Breaking MLPerf Training: A Case Study on Optimizi…

200 papers

Pre-training models are an important tool in Natural Language Processing (NLP), while the BERT model is a classic pre-training model whose structure has been widely adopted by followers. It was even chosen as the reference model for the…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-18 Jinle Zeng , Min Li , Zhihua Wu , Jiaqi Liu , Yuang Liu , Dianhai Yu , Yanjun Ma

Scalable training of large models (like BERT and GPT-3) requires careful optimization rooted in model design, architecture, and system capabilities. From a system standpoint, communication has become a major bottleneck, especially on…

Machine Learning · Computer Science 2021-07-01 Hanlin Tang , Shaoduo Gan , Ammar Ahmad Awan , Samyam Rajbhandari , Conglong Li , Xiangru Lian , Ji Liu , Ce Zhang , Yuxiong He

BERT has recently attracted a lot of attention in natural language understanding (NLU) and achieved state-of-the-art results in various NLU tasks. However, its success requires large deep neural networks and huge amount of data, which…

Machine Learning · Computer Science 2020-09-21 Shuai Zheng , Haibin Lin , Sheng Zha , Mu Li

Pre-trained language models, such as BERT, have achieved significant accuracy gain in many natural language processing tasks. Despite its effectiveness, the huge number of parameters makes training a BERT model computationally very…

Computation and Language · Computer Science 2020-11-30 Cheng Yang , Shengnan Wang , Chao Yang , Yuechuan Li , Ru He , Jingqiao Zhang

1-bit gradient compression and local steps are two representative techniques that enable drastic communication reduction in distributed SGD. Their benefits, however, remain an open question on Adam-based large model pre-training (e.g. BERT…

Machine Learning · Computer Science 2022-05-24 Yucheng Lu , Conglong Li , Minjia Zhang , Christopher De Sa , Yuxiong He

Training large-scale deep learning models has become a key challenge for the scientific community and industry. While the massive use of GPUs can significantly speed up training times, this approach has a negative impact on efficiency. In…

Machine Learning · Computer Science 2025-09-04 David Cortes , Carlos Juiz , Belen Bermejo

Recently, large scale Transformer-based language models such as BERT, GPT-2, and XLNet have brought about exciting leaps in state-of-the-art results for many Natural Language Processing (NLP) tasks. One of the common trends in these recent…

Machine Learning · Computer Science 2020-08-04 Jiahuang Lin , Xin Li , Gennady Pekhimenko

Training large deep neural networks on massive datasets is computationally very challenging. There has been recent surge in interest in using large batch stochastic optimization methods to tackle this issue. The most prominent algorithm in…

To train large models (like BERT and GPT-3) on hundreds of GPUs, communication has become a major bottleneck, especially on commodity systems with limited-bandwidth TCP network. On one side large batch-size optimization such as LAMB…

Machine Learning · Computer Science 2021-10-07 Conglong Li , Ammar Ahmad Awan , Hanlin Tang , Samyam Rajbhandari , Yuxiong He

Adaptive gradient methods, such as Adam and LAMB, have demonstrated excellent performance in the training of large language models. Nevertheless, the need for adaptivity requires maintaining second-moment estimates of the per-parameter…

Computation and Language · Computer Science 2023-08-08 Yang Luo , Xiaozhe Ren , Zangwei Zheng , Zhuo Jiang , Xin Jiang , Yang You

Dramatic increases in the capabilities of neural network models in recent years are driven by scaling model size, training data, and corresponding computational resources. To develop the exceedingly large networks required in modern…

Machine Learning · Computer Science 2025-04-15 Jared Fernandez , Luca Wehrstedt , Leonid Shamis , Mostafa Elhoushi , Kalyan Saladi , Yonatan Bisk , Emma Strubell , Jacob Kahn

Recent state-of-the-art language models utilize a two-phase training procedure comprised of (i) unsupervised pre-training on unlabeled text, and (ii) fine-tuning for a specific supervised task. More recently, many studies have been focused…

Computation and Language · Computer Science 2019-11-15 Itzik Malkiel , Lior Wolf

It is important to scale out deep neural network (DNN) training for reducing model training time. The high communication overhead is one of the major performance bottlenecks for distributed DNN training across multiple GPUs. Our…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-23 Peng Sun , Wansen Feng , Ruobing Han , Shengen Yan , Yonggang Wen

Large language models (LLMs) have demonstrated remarkable success as foundational models, benefiting various downstream applications through fine-tuning. Recent studies on loss scaling have demonstrated the superior performance of larger…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-25 Sajal Dash , Isaac Lyngaas , Junqi Yin , Xiao Wang , Romain Egele , Guojing Cong , Feiyi Wang , Prasanna Balaprakash

Large-batch training has become a cornerstone in accelerating the training of deep neural networks, yet it poses challenges in optimization and generalization. Existing optimizers like AdamW present performance degradation during language…

Machine Learning · Computer Science 2025-08-29 Yang Luo , Zangwei Zheng , Ziheng Qin , Zirui Zhu , Yong Liu , Yang You

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer…

Computation and Language · Computer Science 2020-02-11 Zhenzhong Lan , Mingda Chen , Sebastian Goodman , Kevin Gimpel , Piyush Sharma , Radu Soricut

Nowadays, large and complex deep learning (DL) models are increasingly trained in a distributed manner across multiple worker machines, in which extensive communications between workers pose serious scaling problems. In this article, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-10 Shaohuai Shi , Zhenheng Tang , Xiaowen Chu , Chengjian Liu , Wei Wang , Bo Li

The growth of large language models (LLMs) increases challenges of accelerating distributed training across multiple GPUs in different data centers. Moreover, concerns about data privacy and data exhaustion have heightened interest in…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-18 Zhenheng Tang , Zichen Tang , Junlin Huang , Xinglin Pan , Rudan Yan , Yuxin Wang , Amelie Chi Zhou , Shaohuai Shi , Xiaowen Chu , Bo Li

Training large models with distributed data parallelism (DDP) requires frequent communication of gradients across workers, which can saturate bandwidth. Infrequent communication strategies (e.g., Local SGD) reduce this overhead but, when…

Aligning future system design with the ever-increasing compute needs of large language models (LLMs) is undoubtedly an important problem in today's world. Here, we propose a general performance modeling methodology and workload analysis of…

Hardware Architecture · Computer Science 2024-07-23 Joyjit Kundu , Wenzhe Guo , Ali BanaGozar , Udari De Alwis , Sourav Sengupta , Puneet Gupta , Arindam Mallik
‹ Prev 1 2 3 10 Next ›