Related papers: Optimizing transformer-based machine translation m…

Exploring Hyper-Parameter Optimization for Neural Machine Translation on GPU Architectures

Neural machine translation (NMT) has been accelerated by deep learning neural networks over statistical-based approaches, due to the plethora and programmability of commodity heterogeneous computing architectures such as FPGAs and GPUs and…

Computation and Language · Computer Science 2021-09-15 Robert Lim , Kenneth Heafield , Hieu Hoang , Mark Briers , Allen Malony

Profiling and optimization of multi-card GPU machine learning jobs

The effectiveness and efficiency of machine learning methodologies are crucial, especially with respect to the quality of results and computational cost. This paper discusses different model optimization techniques, providing a…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-30 Marcin Lawenda , Kyrylo Khloponin , Krzesimir Samborski , Łukasz Szustak

Vector quantization loss analysis in VQGANs: a single-GPU ablation study for image-to-image synthesis

This study performs an ablation analysis of Vector Quantized Generative Adversarial Networks (VQGANs), concentrating on image-to-image synthesis utilizing a single NVIDIA A100 GPU. The current work explores the nuanced effects of varying…

Computer Vision and Pattern Recognition · Computer Science 2023-08-11 Luv Verma , Varun Mohan

Cramming: Training a Language Model on a Single GPU in One Day

Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers and practitioners. While most in the…

Computation and Language · Computer Science 2022-12-29 Jonas Geiping , Tom Goldstein

Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective

The rapid scaling of Large Language Models (LLMs) has pushed training workloads far beyond the limits of single-node analysis, demanding a deeper understanding of how these models behave across large-scale, multi-GPU systems. In this paper,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-22 Seokjin Go , Joongun Park , Spandan More , Hanjiang Wu , Irene Wang , Aaron Jezghani , Tushar Krishna , Divya Mahajan

Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

Training Large Language Models(LLMs) is one of the most compute-intensive tasks in high-performance computing. Predicting end-to-end training time for multi-billion parameter models distributed across hundreds of GPUs remains challenging…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-30 Biyao Zhang , Mingkai Zheng , Debargha Ganguly , Xuecen Zhang , Vikash Singh , Vipin Chaudhary , Zhao Zhang

Multi-Path Transformer is Better: A Case Study on Neural Machine Translation

For years the model performance in machine learning obeyed a power-law relationship with the model size. For the consideration of parameter efficiency, recent studies focus on increasing model depth rather than width to achieve better…

Computation and Language · Computer Science 2023-05-11 Ye Lin , Shuhan Zhou , Yanyang Li , Anxiang Ma , Tong Xiao , Jingbo Zhu

Multilingual Machine Translation with Hyper-Adapters

Multilingual machine translation suffers from negative interference across languages. A common solution is to relax parameter sharing with language-specific modules like adapters. However, adapters of related languages are unable to…

Computation and Language · Computer Science 2022-12-06 Christos Baziotis , Mikel Artetxe , James Cross , Shruti Bhosale

Massive Exploration of Neural Machine Translation Architectures

Neural Machine Translation (NMT) has shown remarkable progress over the past few years with production systems now being deployed to end-users. One major drawback of current architectures is that they are expensive to train, typically…

Computation and Language · Computer Science 2017-03-23 Denny Britz , Anna Goldie , Minh-Thang Luong , Quoc Le

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on…

Computation and Language · Computer Science 2021-08-25 Deepak Narayanan , Mohammad Shoeybi , Jared Casper , Patrick LeGresley , Mostofa Patwary , Vijay Anand Korthikanti , Dmitri Vainbrand , Prethvi Kashinkunti , Julie Bernauer , Bryan Catanzaro , Amar Phanishayee , Matei Zaharia

Training Tips for the Transformer Model

This article describes our experiments in neural machine translation using the recent Tensor2Tensor framework and the Transformer sequence-to-sequence model (Vaswani et al., 2017). We examine some of the critical parameters that affect the…

Computation and Language · Computer Science 2018-05-03 Martin Popel , Ondřej Bojar

Can pruning make Large Language Models more efficient?

Transformer models have revolutionized natural language processing with their unparalleled ability to grasp complex contextual relationships. However, the vast number of parameters in these models has raised concerns regarding computational…

Machine Learning · Computer Science 2023-10-10 Sia Gholami , Marwan Omar

Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded Optimizers

Transformers and LLMs have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the training of transformers is slow and often takes in the order…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-18 Avinash Maurya , Jie Ye , M. Mustafa Rafique , Franck Cappello , Bogdan Nicolae

Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

Transformer models have achieved state-of-the-art performance on various domains of applications and gradually becomes the foundations of the advanced large deep learning (DL) models. However, how to train these models over multiple GPUs…

Machine Learning · Computer Science 2022-11-28 Xupeng Miao , Yujie Wang , Youhe Jiang , Chunan Shi , Xiaonan Nie , Hailin Zhang , Bin Cui

Exploring the Limits of Large Scale Pre-training

Recent developments in large-scale machine learning suggest that by scaling up data, model size and training time properly, one might observe that improvements in pre-training would transfer favorably to most downstream tasks. In this work,…

Machine Learning · Computer Science 2021-10-06 Samira Abnar , Mostafa Dehghani , Behnam Neyshabur , Hanie Sedghi

Scalable Data Ablation Approximations for Language Models through Modular Training and Merging

Training data compositions for Large Language Models (LLMs) can significantly affect their downstream performance. However, a thorough data ablation study exploring large sets of candidate data mixtures is typically prohibitively expensive…

Computation and Language · Computer Science 2024-12-10 Clara Na , Ian Magnusson , Ananya Harsh Jha , Tom Sherborne , Emma Strubell , Jesse Dodge , Pradeep Dasigi

Understanding Training Efficiency of Deep Learning Recommendation Models at Scale

The use of GPUs has proliferated for machine learning workflows and is now considered mainstream for many deep learning models. Meanwhile, when training state-of-the-art personal recommendation models, which consume the highest number of…

Hardware Architecture · Computer Science 2020-11-12 Bilge Acun , Matthew Murphy , Xiaodong Wang , Jade Nie , Carole-Jean Wu , Kim Hazelwood

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host…

Computation and Language · Computer Science 2026-04-08 Zhengqing Yuan , Hanchi Sun , Lichao Sun , Yanfang Ye

On the Sparsity of Neural Machine Translation Models

Modern neural machine translation (NMT) models employ a large number of parameters, which leads to serious over-parameterization and typically causes the underutilization of computational resources. In response to this problem, we…

Computation and Language · Computer Science 2020-10-07 Yong Wang , Longyue Wang , Victor O. K. Li , Zhaopeng Tu

Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference

Model selection in Gaussian processes scales prohibitively with the size of the training dataset, both in time and memory. While many approximations exist, all incur inevitable approximation error. Recent work accounts for this error in the…

Machine Learning · Computer Science 2025-07-08 Jonathan Wenger , Kaiwen Wu , Philipp Hennig , Jacob R. Gardner , Geoff Pleiss , John P. Cunningham