Related papers: Characterizing Production GPU Workloads using Syst…

GPU Under Pressure: Estimating Application's Stress via Telemetry and Performance Counters

Graphics Processing Units (GPUs) are specialized accelerators in data centers and high-performance computing (HPC) systems, enabling the fast execution of compute-intensive applications, such as Convolutional Neural Networks (CNNs).…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-10 Giuseppe Esposito , Juan-David Guerrero-Balaguera , Josie Esteban Rodriguez Condia , Matteo Sonza Reorda , Marco Barbiero , Rossella Fortuna

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC Perlmutter

Resource demands of HPC applications vary significantly. However, it is common for HPC systems to primarily assign resources on a per-node basis to prevent interference from co-located workloads. This gap between the coarse-grained resource…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-03-14 Jie Li , George Michelogiannakis , Brandon Cook , Dulanya Cooray , Yong Chen

Extracting Practical, Actionable Energy Insights from Supercomputer Telemetry and Logs

As supercomputers grow in size and complexity, power efficiency has become a critical challenge, particularly in understanding GPU power consumption within modern HPC workloads. This work addresses this challenge by presenting a data…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-22 Melanie Cornelius , Greg Cross , Shilpika Shilpika , Matthew T. Dearing , Zhiling Lan

Power Consumption Analysis of Parallel Algorithms on GPUs

Due to their highly parallel multi-cores architecture, GPUs are being increasingly used in a wide range of computationally intensive applications. Compared to CPUs, GPUs can achieve higher performances at accelerating the programs'…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-05 Frédéric Magoulès , Abal-Kassim Cheik Ahamed , Alban Desmaison , Jean-Christophe Léchenet , François Mayer , Haifa Ben Salem , Thomas Zhu

Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps

CPU-GPU heterogeneous systems are now commonly used in HPC (High-Performance Computing). However, improving the utilization and energy-efficiency of such systems is still one of the most critical issues. As one single program typically…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-08 Eishi Arima , Minjoon Kang , Issa Saba , Josef Weidendorfer , Carsten Trinitis , Martin Schulz

A Case Study on Job Scheduling Policy for Workload Characterization and Power Efficiency

With the increasing popularity of cloud computing, datacenters are becoming more important than ever before. A typical datacenter typically consists of a large number of homogeneous or heterogeneous servers connected by networks.…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-05-15 Aftab Ahmed Chandio , Zhibin Yu , Feroz Shah Syed , Imtiaz Ali Korejo

Contention-Aware GPU Partitioning and Task-to-Partition Allocation for Real-Time Workloads

In order to satisfy timing constraints, modern real-time applications require massively parallel accelerators such as General Purpose Graphic Processing Units (GPGPUs). Generation after generation, the number of computing clusters made…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-24 Houssam-Eddine Zahaf , Ignacio Sanudo Olmedo , Jayati Singh , Nicola Capodieci , Sebastien Faucou

Prediction of Performance and Power Consumption of GPGPU Applications

Graphics Processing Units (GPUs) have become an integral part of High-Performance Computing to achieve an Exascale performance. The main goal of application developers of GPU is to tune their code extensively to obtain optimal performance,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-04 Gargi Alavani , Santonu Sarkar

Not All GPUs Are Created Equal: Characterizing Variability in Large-Scale, Accelerator-Rich Systems

Scientists are increasingly exploring and utilizing the massive parallelism of general-purpose accelerators such as GPUs for scientific breakthroughs. As a result, datacenters, hyperscalers, national computing centers, and supercomputers…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-10 Prasoon Sinha , Akhil Guliani , Rutwik Jain , Brandon Tran , Matthew D. Sinclair , Shivaram Venkataraman

A Comprehensive Analysis of Process Energy Consumption on Multi-Socket Systems with GPUs

Robustly estimating energy consumption in High-Performance Computing (HPC) is essential for assessing the energy footprint of modern workloads, particularly in fields such as Artificial Intelligence (AI) research, development, and…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-10 Luis G. León-Vega , Niccolò Tosato , Stefano Cozzini

GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations

Collocating deep learning training tasks improves GPU utilization but risks resource contention, severe slowdowns, and out-of-memory (OOM) failures. Accurate memory estimation is essential for robust collocation, and GPU utilization…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-29 Ehsan Yousefzadeh-Asl-Miandoab , Reza Karimzadeh , Danyal Yorulmaz , Bulat Ibragimov , Pınar Tözün

More for Less: Integrating Capability-Predominant and Capacity-Predominant Computing

Capability jobs (e.g., large, long-running tasks) and capacity jobs (e.g., small, short-running tasks) are two common types of workloads in high-performance computing (HPC). Different HPC systems are typically deployed to handle distinct…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-23 Zhong Zheng , Michael E. Papka , Zhiling Lan

Understanding the Landscape of Ampere GPU Memory Errors

Graphics Processing Units (GPUs) have become a de facto solution for accelerating high-performance computing (HPC) applications. Understanding their memory error behavior is an essential step toward achieving efficient and reliable HPC…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-05 Zhu Zhu , Yu Sun , Dhatri Parakal , Bo Fang , Steven Farrell , Gregory H. Bauer , Brett Bode , Ian T. Foster , Michael E. Papka , William Gropp , Zhao Zhang , Lishan Yang

In-Situ Assessment of Device-Side Compute Work for Dynamic Load Balancing in a GPU-Accelerated PIC Code

Maintaining computational load balance is important to the performant behavior of codes which operate under a distributed computing model. This is especially true for GPU architectures, which can suffer from memory oversubscription if…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-05 Michael E. Rowan , Axel Huebl , Kevin N. Gott , Jack Deslippe , Maxence Thévenet , Remi Lehe , Jean-Luc Vay

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-27 Zhongyi Lin , Ning Sun , Pallab Bhattacharya , Xizhou Feng , Louis Feng , John D. Owens

LLload: An Easy-to-Use HPC Utilization Tool

The increasing use and cost of high performance computing (HPC) requires new easy-to-use tools to enable HPC users and HPC systems engineers to transparently understand the utilization of resources. The MIT Lincoln Laboratory Supercomputing…

Performance · Computer Science 2025-04-08 Chansup Byun , Albert Reuther , Julie Mullen , LaToya Anderson , William Arcand , Bill Bergeron , David Bestor , Alexander Bonn , Daniel Burrill , Vijay Gadepally , Michael Houle , Matthew Hubbell , Hayden Jananthan , Michael Jones , Piotr Luszczek , Peter Michaleas , Lauren Milechin , Guillermo Morales , Andrew Prout , Antonio Rosa , Charles Yee , Jeremy Kepner

Host-Side Telemetry for Performance Diagnosis in Cloud and HPC GPU Infrastructure

Diagnosing GPU tail latency spikes in cloud and HPC infrastructure is critical for maintaining performance predictability and resource utilization, yet existing monitoring tools lack the granularity for root cause analysis in shared…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-21 Erfan Darzi , Aldo Pareja , Shreeanant Bharadwaj

Unsupervised KPIs-Based Clustering of Jobs in HPC Data Centers

Performance analysis is an essential task in High-Performance Computing (HPC) systems and it is applied for different purposes such as anomaly detection, optimal resource allocation, and budget planning. HPC monitoring tasks generate a huge…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-12 Mohamed S. Halawa , Rebeca P. Díaz-Redondo , Ana Fernández-Vilas

ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels

GPU-based HPC clusters are attracting more scientific application developers due to their extensive parallelism and energy efficiency. In order to achieve portability among a variety of multi/many core architectures, a popular choice for an…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-04-10 Ali TehraniJamsaz , Alok Mishra , Akash Dutta , Abid M. Malik , Barbara Chapman , Ali Jannesari

Timing and Memory Telemetry on GPUs for AI Governance

The rapid expansion of GPU-accelerated computing has enabled major advances in large-scale artificial intelligence (AI), while heightening concerns about how accelerators are observed or governed once deployed. Governance is essential to…

Cryptography and Security · Computer Science 2026-02-13 Saleh K. Monfared , Fatemeh Ganji , Dan Holcomb , Shahin Tajik