English
Related papers

Related papers: Union: An Automatic Workload Manager for Accelerat…

200 papers

The Dragonfly network, with its high-radix and low-diameter structure, is a leading interconnect in high-performance computing. A major challenge is workload interference on shared network links. Parallel discrete event simulation (PDES) is…

Machine Learning · Computer Science 2025-11-17 Xin Wang , Pietro Lodi Rizzini , Sourav Medya , Zhiling Lan

Running scientific workflows on a supercomputer can be a daunting task for a scientific domain specialist. Workflow management solutions (WMS) are a standard method for reducing the complexity of application deployment on high performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-07-30 Wouter Klijn , Sandra Diaz-Pier , Abigail Morrison , Alexander Peyser

Interactive urgent computing is a small but growing user of supercomputing resources. However there are numerous technical challenges that must be overcome to make supercomputers fully suited to the wide range of urgent workloads which…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-29 Nick Brown , Rupert Nash , Gordon Gibb , Evgenij Belikov , Artur Podobas , Wei Der Chien , Stefano Markidis , Markus Flatken , Andreas Gerndt

Dragonfly interconnect is a crucial network technology for supercomputers. To support exascale systems, network resources are shared such that links and routers are not dedicated to any node pair. While link utilization is increased,…

Networking and Internet Architecture · Computer Science 2024-04-05 Yao Kang , Xin Wang , Zhiling Lan

Workload characterization is an integral part of performance analysis of high performance computing (HPC) systems. An understanding of workload properties sheds light on resource utilization and can be used to inform performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-16 Nikolay A. Simakov , Joseph P. White , Robert L. DeLeon , Steven M. Gallo , Matthew D. Jones , Jeffrey T. Palmer , Benjamin Plessinger , Thomas R. Furlani

To meet the extreme compute demands for deep learning across commercial and scientific applications, dataflow accelerators are becoming increasingly popular. While these "domain-specific" accelerators are not fully programmable like CPUs…

Dragonfly class of networks are considered as promising interconnects for next-generation supercomputers. While Dragonfly+ networks offer more path diversity than the original Dragonfly design, they are still prone to performance…

Networking and Internet Architecture · Computer Science 2024-06-24 Yao Kang , Xin Wang , Neil McGlohon , Misbah Mubarak , Sudheer Chunduri , Zhiling Lan

Scientific applications often contain large, computationally-intensive, and irregular parallel loops or tasks that exhibit stochastic characteristics. Applications may suffer from load imbalance during their execution on high-performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-16 Ali Mohammed , Ahmed Eleliemy , Florina M. Ciorba , Franziska Kasielke , Ioana Banicescu

Coupled AI-Simulation workflows are becoming the major workloads for HPC facilities, and their increasing complexity necessitates new tools for performance analysis and prototyping of new in-situ workflows. We present SimAI-Bench, a tool…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-24 Harikrishna Tummalapalli , Riccardo Balin , Christine M. Simpson , Andrew Park , Aymen Alsaadi , Andrew E. Shao , Wesley Brewer , Shantenu Jha

With the growing complexity of computational and experimental facilities, many scientific researchers are turning to machine learning (ML) techniques to analyze large scale ensemble data. With complexities such as multi-component workflows,…

The convergence of IoT, Edge, Cloud, and HPC technologies creates a compute continuum that merges cloud scalability and flexibility with HPC's computational power and specialized optimizations. However, integrating cloud and HPC resources…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-20 Aasish Kumar Sharma , Christian Boehme , Patrick Gelß , Ramin Yahyapour , Julian Kunkel

We present AccaSim, a simulator for workload management in HPC systems. Thanks to AccaSim's scalability to large workload datasets, support for easy customization, and practical automated tools to aid experimentation, users can easily…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-19 Cristian Galleguillos , Zeynep Kiziltan , Alessio Netti , Ricardo Soto

With the advancement of modern robotics, autonomous agents are now capable of hosting sophisticated algorithms, which enables them to make intelligent decisions. But developing and testing such algorithms directly in real-world systems is…

Robotics · Computer Science 2022-08-16 Emon Dey , Jumman Hossain , Nirmalya Roy , Carl Busart

Molecular dynamics (MD) simulations are widely used to study large-scale molecular systems. HPC systems are ideal platforms to run these studies, however, reaching the necessary simulation timescale to detect rare processes is challenging,…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-22 Tu Mai Anh Do , Loïc Pottier , Rafael Ferreira da Silva , Frédéric Suter , Silvina Caíno-Lores , Michela Taufer , Ewa Deelman

Real-time supercomputing performance analysis is a critical aspect of evaluating and optimizing computational systems in a dynamic user environment. The operation of supercomputers produce vast quantities of analytic data from multiple…

Accelerating Machine Learning (ML) workloads requires efficient methods due to their large optimization space. Autotuning has emerged as an effective approach for systematically evaluating variations of implementations. Traditionally,…

Hardware Architecture · Computer Science 2026-01-30 Rebecca Pelke , Nils Bosbach , Lennart M. Reimann , Rainer Leupers

We explore the idea of integrating machine learning (ML) with high performance computing (HPC)-driven simulations to address challenges in using simulations to teach computational science and engineering courses. We demonstrate that a ML…

Physics Education · Physics 2020-09-01 Vikram Jadhao , JCS Kadupitiya

This white paper, developed through close collaboration between IBM Research and UIUC researchers within the IIDAI Institute, envisions transforming hybrid cloud systems to meet the growing complexity of AI workloads through innovative,…

Capability jobs (e.g., large, long-running tasks) and capacity jobs (e.g., small, short-running tasks) are two common types of workloads in high-performance computing (HPC). Different HPC systems are typically deployed to handle distinct…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-23 Zhong Zheng , Michael E. Papka , Zhiling Lan

Computing demands for large scientific experiments, such as the CMS experiment at the CERN LHC, will increase dramatically in the next decades. To complement the future performance increases of software running on central processing units…

Instrumentation and Detectors · Physics 2024-09-09 CMS Collaboration
‹ Prev 1 2 3 10 Next ›