English
Related papers

Related papers: Phantora: Maximizing Code Reuse in Simulation-base…

200 papers

Accelerating Machine Learning (ML) workloads requires efficient methods due to their large optimization space. Autotuning has emerged as an effective approach for systematically evaluating variations of implementations. Traditionally,…

Hardware Architecture · Computer Science 2026-01-30 Rebecca Pelke , Nils Bosbach , Lennart M. Reimann , Rainer Leupers

Predicting the performance of large-scale distributed machine learning (ML) workloads across multiple accelerator architectures remains a central challenge in ML system design. Existing GPU and TPU focused simulators are typically…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-15 Jonas Svedas , Nathan Laubeuf , Ryan Harvey , Arjun Singh , Changhai Man , Abubakr Nada , Tushar Krishna , James Myers , Debjyoti Bhattacharjee

The rapid scaling of Large Language Models (LLMs) has pushed training workloads far beyond the limits of single-node analysis, demanding a deeper understanding of how these models behave across large-scale, multi-GPU systems. In this paper,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-22 Seokjin Go , Joongun Park , Spandan More , Hanjiang Wu , Irene Wang , Aaron Jezghani , Tushar Krishna , Divya Mahajan

Reinforcement learning (RL) workloads take a notoriously long time to train due to the large number of samples collected at run-time from simulators. Unfortunately, cluster scale-up approaches remain expensive, and commonly used CPU…

Machine Learning · Computer Science 2022-07-19 James Gleeson , Daniel Snider , Yvonne Yang , Moshe Gabel , Eyal de Lara , Gennady Pekhimenko

Data loaders are used by Machine Learning (ML) frameworks like PyTorch and TensorFlow to apply transformations to data before feeding it into the accelerator. This operation is called data preprocessing. Data preprocessing plays an…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-29 Rahma Nouaji , Stella Bitchebe , Ricardo Macedo , Oana Balmau

We propose a simulation-based approach for performance modeling of parallel applications on high-performance computing platforms. Our approach enables full-system performance modeling: (1) the hardware platform is represented by an abstract…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-06 Gen Xu , Huda Ibeid , Xin Jiang , Vjekoslav Svilan , Zhaojuan Bian

Implementing Machine Learning (ML) models on Field-Programmable Gate Arrays (FPGAs) is becoming increasingly popular across various domains as a low-latency and low-power solution that helps manage large data rates generated by continuously…

Machine Learning · Computer Science 2024-08-13 Mohammad Mehdi Rahimifar , Hamza Ezzaoui Rahali , Audrey C. Therrien

In recent years, large language models (LLMs) have driven substantial intelligent transformation across diverse industries. Commercial LLM training is typically performed over data center networks (DCNs) comprising hundreds to thousands of…

Networking and Internet Architecture · Computer Science 2026-03-20 Wenyi Wang , Zheng Wu , Yanmeng Wang , Haolin Mao , Lei Han , Gaogang Xie , Fu Xiao

In high-energy physics, the increasing luminosity and detector granularity at the Large Hadron Collider are driving the need for more efficient data processing solutions. Machine Learning has emerged as a promising tool for reconstructing…

High Energy Physics - Experiment · Physics 2025-05-01 Fotis I. Giasemis , Vladimir Lončar , Bertrand Granado , Vladimir Vava Gligorov

While discrete-event simulators are essential tools for architecture research, design, and development, their practicality is limited by an extremely long time-to-solution for realistic applications under investigation. This work describes…

Hardware Architecture · Computer Science 2022-04-07 Lingda Li , Santosh Pandey , Thomas Flynn , Hang Liu , Noel Wheeler , Adolfy Hoisie

Large language model (LLM) training today runs on clusters spanning thousands of GPUs. While this scale enables rapid model advances, developing, debugging, and performance-tuning the training framework inevitably becomes complex and…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-18 Shaoke Xi , ChonLam Lao , Boyi Jia , Jiaqi Gao , Zhipeng Zhang , Jiamin Cao , Brian Sutioso , Erci Xu , Minlan Yu , Kui Ren , Yong Li , Zhengping Qian , Ennan Zhai , Jingren Zhou

Collocating deep learning training tasks improves GPU utilization but risks resource contention, severe slowdowns, and out-of-memory (OOM) failures. Accurate memory estimation is essential for robust collocation, and GPU utilization…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-29 Ehsan Yousefzadeh-Asl-Miandoab , Reza Karimzadeh , Danyal Yorulmaz , Bulat Ibragimov , Pınar Tözün

Large-scale computing systems are increasingly using accelerators such as GPUs to enable peta- and exa-scale levels of compute to meet the needs of Machine Learning (ML) and scientific computing applications. Given the widespread and…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-20 Rutwik Jain , Brandon Tran , Keting Chen , Matthew D. Sinclair , Shivaram Venkataraman

With the widespread adoption of Large Language Models (LLMs), the demand for high-performance LLM inference services continues to grow. To meet this demand, a growing number of AI accelerators have been proposed, such as Google TPU, Huawei…

Hardware Architecture · Computer Science 2025-10-08 Tianhao Zhu , Dahu Feng , Erhu Feng , Yubin Xia

Machine learning libraries such as TensorFlow and PyTorch simplify model implementation. However, researchers are still required to perform a non-trivial amount of manual tasks such as GPU allocation, training status tracking, and…

Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-27 Zhongyi Lin , Ning Sun , Pallab Bhattacharya , Xizhou Feng , Louis Feng , John D. Owens

Training LLMs in distributed environments presents significant challenges due to the complexity of model execution, deployment systems, and the vast space of configurable strategies. Although various optimization techniques exist, achieving…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-15 Mingyu Liang , Hiwot Tadese Kassa , Wenyin Fu , Brian Coutinho , Louis Feng , Christina Delimitrou

To fully exploit the physics potential of current and future high energy particle colliders, machine learning (ML) can be implemented in detector electronics for intelligent data processing and acquisition. The implementation of ML in…

Instrumentation and Detectors · Physics 2024-11-19 Haoyi Jia , Abhilasha Dave , Julia Gonski , Ryan Herbst

Reinforcement Learning (RL) has become the most effective post-training approach for improving the capabilities of Large Language Models (LLMs). In practice, because of the high demands on latency and memory, it is particularly challenging…

We explored the possible benefits of integrating quantum simulators in a "hybrid" quantum machine learning (QML) workflow that uses both classical and quantum computations in a high-performance computing (HPC) environment. Here, we used two…

Emerging Technologies · Computer Science 2024-07-11 Samuel T. Bieberich , Michael A. Sandoval
‹ Prev 1 2 3 10 Next ›