Related papers: Maya: Optimizing Deep Learning Training Workloads …

EMA: Efficient Model Adaptation for Learning-based Systems

Machine learning (ML) is increasingly applied to optimize system performance in tasks such as resource management and network simulation. Unlike traditional ML tasks (e.g., image classification), networked systems often operate in…

Machine Learning · Computer Science 2026-05-15 Daiyang Yu , Xinyu Chen , Yihan Zhang , Yan Liang , Yaqi Qiao , Fan Lai

Characterizing Deep Learning Training Workloads on Alibaba-PAI

Modern deep learning models have been exploited in various domains, including computer vision (CV), natural language processing (NLP), search and recommendation. In practical AI clusters, workloads training these models are run using…

Performance · Computer Science 2019-10-15 Mengdi Wang , Chen Meng , Guoping Long , Chuan Wu , Jun Yang , Wei Lin , Yangqing Jia

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-27 Zhongyi Lin , Ning Sun , Pallab Bhattacharya , Xizhou Feng , Louis Feng , John D. Owens

Dynamic GPU Energy Optimization for Machine Learning Training Workloads

GPUs are widely used to accelerate the training of machine learning workloads. As modern machine learning models become increasingly larger, they require a longer time to train, leading to higher GPU energy consumption. This paper presents…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-01-06 Farui Wang , Weizhe Zhang , Shichao Lai , Meng Hao , Zheng Wang

MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs. Our analysis, grounded in real-world large model training on…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-12 Samuel Hsia , Alicia Golden , Bilge Acun , Newsha Ardalani , Zachary DeVito , Gu-Yeon Wei , David Brooks , Carole-Jean Wu

Hydra: A System for Large Multi-Model Deep Learning

Scaling up model depth and size is now a common approach to raise accuracy in many deep learning (DL) applications, as evidenced by the widespread success of multi-billion or even trillion parameter models in natural language processing…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-05 Kabir Nagrecha , Arun Kumar

Estudio de la eficiencia en la escalabilidad de GPUs para el entrenamiento de Inteligencia Artificial

Training large-scale deep learning models has become a key challenge for the scientific community and industry. While the massive use of GPUs can significantly speed up training times, this approach has a negative impact on efficiency. In…

Machine Learning · Computer Science 2025-09-04 David Cortes , Carlos Juiz , Belen Bermejo

PUMA: margin-based data pruning

Deep learning has been able to outperform humans in terms of classification accuracy in many tasks. However, to achieve robustness to adversarial perturbations, the best methodologies require to perform adversarial training on a much larger…

Machine Learning · Computer Science 2024-05-13 Javier Maroto , Pascal Frossard

$\mathcal{Y}$-Tuning: An Efficient Tuning Paradigm for Large-Scale Pre-Trained Models via Label Representation Learning

With the success of large-scale pre-trained models (PTMs), how efficiently adapting PTMs to downstream tasks has attracted tremendous attention, especially for PTMs with billions of parameters. Although some parameter-efficient tuning…

Computation and Language · Computer Science 2023-01-10 Yitao Liu , Chenxin An , Xipeng Qiu

Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

Training Large Language Models(LLMs) is one of the most compute-intensive tasks in high-performance computing. Predicting end-to-end training time for multi-billion parameter models distributed across hundreds of GPUs remains challenging…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-30 Biyao Zhang , Mingkai Zheng , Debargha Ganguly , Xuecen Zhang , Vikash Singh , Vipin Chaudhary , Zhao Zhang

Varuna: Scalable, Low-cost Training of Massive Deep Learning Models

Systems for training massive deep learning models (billions of parameters) today assume and require specialized "hyper-clusters": hundreds or thousands of GPUs wired with specialized high-bandwidth interconnects such as NV-Link and…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-16 Sanjith Athlur , Nitika Saran , Muthian Sivathanu , Ramachandran Ramjee , Nipun Kwatra

Scaling Performance of Large Language Model Pretraining

Large language models (LLMs) show best-in-class performance across a wide range of natural language processing applications. Training these models is an extremely computationally expensive task; frontier Artificial Intelligence (AI)…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-10 Alexander Interrante-Grant , Carla Varela-Rosa , Suhaas Narayan , Chris Connelly , Albert Reuther

Predicting Performance of Heterogeneous AI Systems with Discrete-Event Simulations

In recent years, artificial intelligence (AI) technologies have found industrial applications in various fields. AI systems typically possess complex software and heterogeneous CPU/GPU hardware architecture, making it difficult to answer…

Software Engineering · Computer Science 2022-04-08 Vyacheslav Zhdanovskiy , Lev Teplyakov , Anton Grigoryev

An Analysis of Collocation on GPUs for Deep Learning Training

Deep learning training is an expensive process that extensively uses GPUs, but not all model training saturates modern powerful GPUs. Multi-Instance GPU (MIG) is a new technology introduced by NVIDIA that can partition a GPU to better-fit…

Machine Learning · Computer Science 2023-04-25 Ties Robroek , Ehsan Yousefzadeh-Asl-Miandoab , Pınar Tözün

A Study of Optimizations for Fine-tuning Large Language Models

Fine-tuning large language models is a popular choice among users trying to adapt them for specific applications. However, fine-tuning these models is a demanding task because the user has to examine several factors, such as resource…

Machine Learning · Computer Science 2024-06-07 Arjun Singh , Nikhil Pandey , Anup Shirgaonkar , Pavan Manoj , Vijay Aski

Assessing Resource-Performance Trade-off of Natural Language Models using Data Envelopment Analysis

Natural language models are often summarized through a high-dimensional set of descriptive metrics including training corpus size, training time, the number of trainable parameters, inference times, and evaluation statistics that assess…

Computation and Language · Computer Science 2022-11-04 Zachary Zhou , Alisha Zachariah , Devin Conathan , Jeffery Kline

DEER: Deep Runahead for Instruction Prefetching on Modern Mobile Workloads

Mobile workloads incur heavy frontend stalls due to increasingly large code footprints as well as long repeat cycles. Existing instruction-prefetching techniques suffer from low coverage, poor timeliness, or high cost. We provide a SW/HW…

Performance · Computer Science 2025-04-30 Parmida Vahdatniya , Julian Humecki , Henry Kao , Tony Li , Ali Sedaghati , Fang Su , Ruoyu Zhou , Alex Bi , Reza Azimi , Maziar Goudarzi

Integrating Performance Tools in Model Reasoning for GPU Kernel Optimization

Language models are now prevalent in software engineering with many developers using them to automate tasks and accelerate their development. While language models have been tremendous at accomplishing complex software engineering tasks,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-21 Daniel Nichols , Konstantinos Parasyris , Charles Jekel , Abhinav Bhatele , Harshitha Menon

Data Complexity-aware Deep Model Performance Forecasting

Deep learning models are widely used across computer vision and other domains. When working on the model induction, selecting the right architecture for a given dataset often relies on repetitive trial-and-error procedures. This procedure…

Machine Learning · Computer Science 2026-01-06 Yen-Chia Chen , Hsing-Kuo Pao , Hanjuan Huang

PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training

Large model training beyond tens of thousands of GPUs is an uncharted territory. At such scales, disruptions to the training process are not a matter of if, but a matter of when -- a stochastic process degrading training productivity.…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-14 Alicia Golden , Michael Kuchnik , Samuel Hsia , Zachary DeVito , Gu-Yeon Wei , David Brooks , Carole-Jean Wu