Related papers: Phantora: Maximizing Code Reuse in Simulation-base…

Introducing Instruction-Accurate Simulators for Performance Estimation of Autotuning Workloads

Accelerating Machine Learning (ML) workloads requires efficient methods due to their large optimization space. Autotuning has emerged as an effective approach for systematically evaluating variations of implementations. Traditionally,…

Hardware Architecture · Computer Science 2026-01-30 Rebecca Pelke , Nils Bosbach , Lennart M. Reimann , Rainer Leupers

Evaluating Cross-Architecture Performance Modeling of Distributed ML Workloads Using StableHLO

Predicting the performance of large-scale distributed machine learning (ML) workloads across multiple accelerator architectures remains a central challenge in ML system design. Existing GPU and TPU focused simulators are typically…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-15 Jonas Svedas , Nathan Laubeuf , Ryan Harvey , Arjun Singh , Changhai Man , Abubakr Nada , Tushar Krishna , James Myers , Debjyoti Bhattacharjee

Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective

The rapid scaling of Large Language Models (LLMs) has pushed training workloads far beyond the limits of single-node analysis, demanding a deeper understanding of how these models behave across large-scale, multi-GPU systems. In this paper,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-22 Seokjin Go , Joongun Park , Spandan More , Hanjiang Wu , Irene Wang , Aaron Jezghani , Tushar Krishna , Divya Mahajan

Optimizing Data Collection in Deep Reinforcement Learning

Reinforcement learning (RL) workloads take a notoriously long time to train due to the large number of samples collected at run-time from simulators. Unfortunately, cluster scale-up approaches remain expensive, and commonly used CPU…

Machine Learning · Computer Science 2022-07-19 James Gleeson , Daniel Snider , Yvonne Yang , Moshe Gabel , Eyal de Lara , Gennady Pekhimenko

MinatoLoader: Accelerating Machine Learning Training Through Efficient Data Preprocessing

Data loaders are used by Machine Learning (ML) frameworks like PyTorch and TensorFlow to apply transformations to data before feeding it into the accelerator. This operation is called data preprocessing. Data preprocessing plays an…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-29 Rahma Nouaji , Stella Bitchebe , Ricardo Macedo , Oana Balmau

Simulation-Based Performance Prediction of HPC Applications: A Case Study of HPL

We propose a simulation-based approach for performance modeling of parallel applications on high-performance computing platforms. Our approach enables full-system performance modeling: (1) the hardware platform is represented by an abstract…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-06 Gen Xu , Huda Ibeid , Xin Jiang , Vjekoslav Svilan , Zhaojuan Bian

rule4ml: An Open-Source Tool for Resource Utilization and Latency Estimation for ML Models on FPGA

Implementing Machine Learning (ML) models on Field-Programmable Gate Arrays (FPGAs) is becoming increasingly popular across various domains as a low-latency and low-power solution that helps manage large data rates generated by continuously…

Machine Learning · Computer Science 2024-08-13 Mohammad Mehdi Rahimifar , Hamza Ezzaoui Rahali , Audrey C. Therrien

HyGra: Accelerating Network-State Simulation for LLM Training in DCNs via Adaptive Packet-Flow Granularity

In recent years, large language models (LLMs) have driven substantial intelligent transformation across diverse industries. Commercial LLM training is typically performed over data center networks (DCNs) comprising hundreds to thousands of…

Networking and Internet Architecture · Computer Science 2026-03-20 Wenyi Wang , Zheng Wu , Yanmeng Wang , Haolin Mao , Lei Han , Gaogang Xie , Fu Xiao

Comparative Analysis of FPGA and GPU Performance for Machine Learning-Based Track Reconstruction at LHCb

In high-energy physics, the increasing luminosity and detector granularity at the Large Hadron Collider are driving the need for more efficient data processing solutions. Machine Learning has emerged as a promising tool for reconstructing…

High Energy Physics - Experiment · Physics 2025-05-01 Fotis I. Giasemis , Vladimir Lončar , Bertrand Granado , Vladimir Vava Gligorov

SimNet: Accurate and High-Performance Computer Architecture Simulation using Deep Learning

While discrete-event simulators are essential tools for architecture research, design, and development, their practicality is limited by an extremely long time-to-solution for realistic applications under investigation. This work describes…

Hardware Architecture · Computer Science 2022-04-07 Lingda Li , Santosh Pandey , Thomas Flynn , Hang Liu , Noel Wheeler , Adolfy Hoisie

A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM

Large language model (LLM) training today runs on clusters spanning thousands of GPUs. While this scale enables rapid model advances, developing, debugging, and performance-tuning the training framework inevitably becomes complex and…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-18 Shaoke Xi , ChonLam Lao , Boyi Jia , Jiaqi Gao , Zhipeng Zhang , Jiamin Cao , Brian Sutioso , Erci Xu , Minlan Yu , Kui Ren , Yong Li , Zhengping Qian , Ennan Zhai , Jingren Zhou

GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations

Collocating deep learning training tasks improves GPU utilization but risks resource contention, severe slowdowns, and out-of-memory (OOM) failures. Accurate memory estimation is essential for robust collocation, and GPU utilization…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-29 Ehsan Yousefzadeh-Asl-Miandoab , Reza Karimzadeh , Danyal Yorulmaz , Bulat Ibragimov , Pınar Tözün

PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters

Large-scale computing systems are increasingly using accelerators such as GPUs to enable peta- and exa-scale levels of compute to meet the needs of Machine Learning (ML) and scientific computing applications. Given the widespread and…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-20 Rutwik Jain , Brandon Tran , Keting Chen , Matthew D. Sinclair , Shivaram Venkataraman

From Principles to Practice: A Systematic Study of LLM Serving on Multi-core NPUs

With the widespread adoption of Large Language Models (LLMs), the demand for high-performance LLM inference services continues to grow. To meet this demand, a growing number of AI accelerators have been proposed, such as Google TPU, Huawei…

Hardware Architecture · Computer Science 2025-10-08 Tianhao Zhu , Dahu Feng , Erhu Feng , Yubin Xia

NSML: A Machine Learning Platform That Enables You to Focus on Your Models

Machine learning libraries such as TensorFlow and PyTorch simplify model implementation. However, researchers are still required to perform a non-trivial amount of manual tasks such as GPU allocation, training status tracking, and…

Machine Learning · Computer Science 2017-12-19 Nako Sung , Minkyu Kim , Hyunwoo Jo , Youngil Yang , Jingwoong Kim , Leonard Lausen , Youngkwan Kim , Gayoung Lee , Donghyun Kwak , Jung-Woo Ha , Sunghun Kim

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-27 Zhongyi Lin , Ning Sun , Pallab Bhattacharya , Xizhou Feng , Louis Feng , John D. Owens

Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training

Training LLMs in distributed environments presents significant challenges due to the complexity of model execution, deployment systems, and the vast space of configurable strategies. Although various optimization techniques exist, achieving…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-15 Mingyu Liang , Hiwot Tadese Kassa , Wenyin Fu , Brian Coutinho , Louis Feng , Christina Delimitrou

Analysis of Hardware Synthesis Strategies for Machine Learning in Collider Trigger and Data Acquisition

To fully exploit the physics potential of current and future high energy particle colliders, machine learning (ML) can be implemented in detector electronics for intelligent data processing and acquisition. The implementation of ML in…

Instrumentation and Detectors · Physics 2024-11-19 Haoyi Jia , Abhilasha Dave , Julia Gonski , Ryan Herbst

LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Training

Reinforcement Learning (RL) has become the most effective post-training approach for improving the capabilities of Large Language Models (LLMs). In practice, because of the high demands on latency and memory, it is particularly challenging…

Machine Learning · Computer Science 2025-06-03 Bo Wu , Sid Wang , Yunhao Tang , Jia Ding , Eryk Helenowski , Liang Tan , Tengyu Xu , Tushar Gowda , Zhengxing Chen , Chen Zhu , Xiaocheng Tang , Yundi Qian , Beibei Zhu , Rui Hou

Analyzing Machine Learning Performance in a Hybrid Quantum Computing and HPC Environment

We explored the possible benefits of integrating quantum simulators in a "hybrid" quantum machine learning (QML) workflow that uses both classical and quantum computations in a high-performance computing (HPC) environment. Here, we used two…

Emerging Technologies · Computer Science 2024-07-11 Samuel T. Bieberich , Michael A. Sandoval