Related papers: Union: An Automatic Workload Manager for Accelerat…

SMART: A Surrogate Model for Predicting Application Runtime in Dragonfly Systems

The Dragonfly network, with its high-radix and low-diameter structure, is a leading interconnect in high-performance computing. A major challenge is workload interference on shared network links. Parallel discrete event simulation (PDES) is…

Machine Learning · Computer Science 2025-11-17 Xin Wang , Pietro Lodi Rizzini , Sourav Medya , Zhiling Lan

Staged deployment of interactive multi-application HPC workflows

Running scientific workflows on a supercomputer can be a daunting task for a scientific domain specialist. Workflow management solutions (WMS) are a standard method for reducing the complexity of application deployment on high performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-07-30 Wouter Klijn , Sandra Diaz-Pier , Abigail Morrison , Alexander Peyser

Workflows to driving high-performance interactive supercomputing for urgent decision making

Interactive urgent computing is a small but growing user of supercomputing resources. However there are numerous technical challenges that must be overcome to make supercomputers fully suited to the wide range of urgent workloads which…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-29 Nick Brown , Rupert Nash , Gordon Gibb , Evgenij Belikov , Artur Podobas , Wei Der Chien , Stefano Markidis , Markus Flatken , Andreas Gerndt

Study of Workload Interference with Intelligent Routing on Dragonfly

Dragonfly interconnect is a crucial network technology for supercomputers. To support exascale systems, network resources are shared such that links and routers are not dedicated to any node pair. While link utilization is increased,…

Networking and Internet Architecture · Computer Science 2024-04-05 Yao Kang , Xin Wang , Zhiling Lan

A Workload Analysis of NSF's Innovative HPC Resources Using XDMoD

Workload characterization is an integral part of performance analysis of high performance computing (HPC) systems. An understanding of workload properties sheds light on resource utilization and can be used to inform performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-16 Nikolay A. Simakov , Joseph P. White , Robert L. DeLeon , Steven M. Gallo , Matthew D. Jones , Jeffrey T. Palmer , Benjamin Plessinger , Thomas R. Furlani

Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor Operations on Spatial Accelerators

To meet the extreme compute demands for deep learning across commercial and scientific applications, dataflow accelerators are becoming increasingly popular. While these "domain-specific" accelerators are not fully programmable like CPUs…

Hardware Architecture · Computer Science 2021-11-09 Geonhwa Jeong , Gokcen Kestor , Prasanth Chatarasi , Angshuman Parashar , Po-An Tsai , Sivasankaran Rajamanickam , Roberto Gioiosa , Tushar Krishna

Modeling and Analysis of Application Interference on Dragonfly+

Dragonfly class of networks are considered as promising interconnects for next-generation supercomputers. While Dragonfly+ networks offer more path diversity than the original Dragonfly design, they are still prone to performance…

Networking and Internet Architecture · Computer Science 2024-06-24 Yao Kang , Xin Wang , Neil McGlohon , Misbah Mubarak , Sudheer Chunduri , Zhiling Lan

An Approach for Realistically Simulating the Performance of Scientific Applications on High Performance Computing Systems

Scientific applications often contain large, computationally-intensive, and irregular parallel loops or tasks that exhibit stochastic characteristics. Applications may suffer from load imbalance during their execution on high-performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-16 Ali Mohammed , Ahmed Eleliemy , Florina M. Ciorba , Franziska Kasielke , Ioana Banicescu

In-Transit Data Transport Strategies for Coupled AI-Simulation Workflow Patterns

Coupled AI-Simulation workflows are becoming the major workloads for HPC facilities, and their increasing complexity necessitates new tools for performance analysis and prototyping of new in-situ workflows. We present SimAI-Bench, a tool…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-24 Harikrishna Tummalapalli , Riccardo Balin , Christine M. Simpson , Andrew Park , Aymen Alsaadi , Andrew E. Shao , Wesley Brewer , Shantenu Jha

Enabling Machine Learning-Ready HPC Ensembles with Merlin

With the growing complexity of computational and experimental facilities, many scientific researchers are turning to machine learning (ML) techniques to analyze large scale ensemble data. With complexities such as multi-component workflows,…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-07-05 J. Luc Peterson , Ben Bay , Joe Koning , Peter Robinson , Jessica Semler , Jeremy White , Rushil Anirudh , Kevin Athey , Peer-Timo Bremer , Francesco Di Natale , David Fox , Jim A. Gaffney , Sam A. Jacobs , Bhavya Kailkhura , Bogdan Kustowski , Steven Langer , Brian Spears , Jayaraman Thiagarajan , Brian Van Essen , Jae-Seung Yeom

Workflow-Driven Modeling for the Compute Continuum: An Optimization Approach to Automated System and Workload Scheduling

The convergence of IoT, Edge, Cloud, and HPC technologies creates a compute continuum that merges cloud scalability and flexibility with HPC's computational power and specialized optimizations. However, integrating cloud and HPC resources…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-20 Aasish Kumar Sharma , Christian Boehme , Patrick Gelß , Ramin Yahyapour , Julian Kunkel

AccaSim: a Customizable Workload Management Simulator for Job Dispatching Research in HPC Systems

We present AccaSim, a simulator for workload management in HPC systems. Thanks to AccaSim's scalability to large workload datasets, support for easy customization, and practical automated tools to aid experimentation, users can easily…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-19 Cristian Galleguillos , Zeynep Kiziltan , Alessio Netti , Ricardo Soto

SynchroSim: An Integrated Co-simulation Middleware for Heterogeneous Multi-robot System

With the advancement of modern robotics, autonomous agents are now capable of hosting sophisticated algorithms, which enables them to make intelligent decisions. But developing and testing such algorithms directly in real-world systems is…

Robotics · Computer Science 2022-08-16 Emon Dey , Jumman Hossain , Nirmalya Roy , Carl Busart

Co-scheduling Ensembles of In Situ Workflows

Molecular dynamics (MD) simulations are widely used to study large-scale molecular systems. HPC systems are ideal platforms to run these studies, however, reaching the necessary simulation timescale to detect rare processes is challenging,…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-22 Tu Mai Anh Do , Loïc Pottier , Rafael Ferreira da Silva , Frédéric Suter , Silvina Caíno-Lores , Michela Taufer , Ewa Deelman

Supercomputer 3D Digital Twin for User Focused Real-Time Monitoring

Real-time supercomputing performance analysis is a critical aspect of evaluating and optimizing computational systems in a dynamic user environment. The operation of supercomputers produce vast quantities of analytic data from multiple…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-02 William Bergeron , Matthew Hubbell , Daniel Mojica , Albert Reuther , William Arcand , David Bestor , Daniel Burrill , Chansup , Byun , Vijay Gadepally , Michael Houle , Hayden Jananthan , Michael Jones , Piotr Luszczek , Peter Michaleas , Lauren Milechin , Julie Mullen Andrew Prout , Antonio Rosa , Charles Yee , Jeremy Kepner

Introducing Instruction-Accurate Simulators for Performance Estimation of Autotuning Workloads

Accelerating Machine Learning (ML) workloads requires efficient methods due to their large optimization space. Autotuning has emerged as an effective approach for systematically evaluating variations of implementations. Traditionally,…

Hardware Architecture · Computer Science 2026-01-30 Rebecca Pelke , Nils Bosbach , Lennart M. Reimann , Rainer Leupers

Integrating Machine Learning with HPC-driven Simulations for Enhanced Student Learning

We explore the idea of integrating machine learning (ML) with high performance computing (HPC)-driven simulations to address challenges in using simulations to teach computational science and engineering courses. We demonstrate that a ML…

Physics Education · Physics 2020-09-01 Vikram Jadhao , JCS Kadupitiya

Transforming the Hybrid Cloud for Emerging AI Workloads

This white paper, developed through close collaboration between IBM Research and UIUC researchers within the IIDAI Institute, envisions transforming hybrid cloud systems to meet the growing complexity of AI workloads through innovative,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-23 Deming Chen , Alaa Youssef , Ruchi Pendse , André Schleife , Bryan K. Clark , Hendrik Hamann , Jingrui He , Teodoro Laino , Lav Varshney , Yuxiong Wang , Avirup Sil , Reyhaneh Jabbarvand , Tianyin Xu , Volodymyr Kindratenko , Carlos Costa , Sarita Adve , Charith Mendis , Minjia Zhang , Santiago Núñez-Corrales , Raghu Ganti , Mudhakar Srivatsa , Nam Sung Kim , Josep Torrellas , Jian Huang , Seetharami Seelam , Klara Nahrstedt , Tarek Abdelzaher , Tamar Eilam , Huimin Zhao , Matteo Manica , Ravishankar Iyer , Martin Hirzel , Vikram Adve , Darko Marinov , Hubertus Franke , Hanghang Tong , Elizabeth Ainsworth , Han Zhao , Deepak Vasisht , Minh Do , Sahil Suneja , Fabio Oliveira , Giovanni Pacifici , Ruchir Puri , Priya Nagpurkar

More for Less: Integrating Capability-Predominant and Capacity-Predominant Computing

Capability jobs (e.g., large, long-running tasks) and capacity jobs (e.g., small, short-running tasks) are two common types of workloads in high-performance computing (HPC). Different HPC systems are typically deployed to handle distinct…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-23 Zhong Zheng , Michael E. Papka , Zhiling Lan

Portable acceleration of CMS computing workflows with coprocessors as a service

Computing demands for large scientific experiments, such as the CMS experiment at the CERN LHC, will increase dramatically in the next decades. To complement the future performance increases of software running on central processing units…

Instrumentation and Detectors · Physics 2024-09-09 CMS Collaboration