Related papers: Job Scheduling in High Performance Computing

Scheduler Technologies in Support of High Performance Data Analysis

Job schedulers are a key component of scalable computing infrastructures. They orchestrate all of the work executed on the computing infrastructure and directly impact the effectiveness of the system. Recently, job workloads have…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-03-06 Albert Reuther , Chansup Byun , William Arcand , David Bestor , Bill Bergeron , Matthew Hubbell , Michael Jones , Peter Michaleas , Andrew Prout , Antonio Rosa , Jeremy Kepner

Scheduling Beyond CPUs for HPC

High performance computing (HPC) is undergoing significant changes. The emerging HPC applications comprise both compute- and data-intensive applications. To meet the intense I/O demand from emerging data-intensive applications, burst…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-11 Yuping Fan , Zhiling Lan , Paul Rich , William E. Allcock , Michael E. Papka , Brian Austin , David Paul

Scalable System Scheduling for HPC and Big Data

In the rapidly expanding field of parallel processing, job schedulers are the "operating systems" of modern big data architectures and supercomputing systems. Job schedulers allocate computing resources and control the execution of…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-03-06 Albert Reuther , Chansup Byun , William Arcand , David Bestor , Bill Bergeron , Matthew Hubbell , Michael Jones , Peter Michaleas , Andrew Prout , Antonio Rosa , Jeremy Kepner

A HPC Co-Scheduler with Reinforcement Learning

Although High Performance Computing (HPC) users understand basic resource requirements such as the number of CPUs and memory limits, internal infrastructural utilization data is exclusively leveraged by cluster operators, who use it to…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-01-19 Abel Souza , Kristiaan Pelckmans , Johan Tordsson

Conceptual and Technical Challenges for High Performance Computing

High Performance Computing (HPC) aims at providing reasonably fast computing solutions to scientific and real life problems. The advent of multicore architectures is noticeable in the HPC history, because it has brought the underlying…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-07 Claude Tadonki

Three Practical Workflow Schedulers for Easy Maximum Parallelism

Runtime scheduling and workflow systems are an increasingly popular algorithmic component in HPC because they allow full system utilization with relaxed synchronization requirements. There are so many special-purpose tools for task…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-03 David M. Rogers

Periodic I/O scheduling for super-computers

With the ever-growing need of data in HPC applications, the congestion at the I/O level becomes critical in super-computers. Architectural enhancement such as burst-buffers and pre-fetching are added to machines, but are not sufficient to…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-02-23 Guillaume Aupy , Ana Gainaru , Valentin Le Fèvre

Helping HPC Users Specify Job Memory Requirements via Machine Learning

Resource allocation in High Performance Computing (HPC) settings is still not easy for end-users due to the wide variety of application and environment configuration options. Users have difficulties to estimate the number of processors and…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-10 Eduardo R. Rodrigues , Renato L. F. Cunha , Marco A. S. Netto , Michael Spriggs

RLScheduler: An Automated HPC Batch Job Scheduler Using Reinforcement Learning

Today high-performance computing (HPC) platforms are still dominated by batch jobs. Accordingly, effective batch job scheduling is crucial to obtain high system efficiency. Existing HPC batch job schedulers typically leverage heuristic…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-03 Di Zhang , Dong Dai , Youbiao He , Forrest Sheng Bao , Bing Xie

Effective Handling of Urgent Jobs - Speed Up Scheduling for Computing Applications

A queue is required when a service provider is not able to handle jobs arriving over the time. In a highly flexible and dynamic environment, some jobs might demand for faster execution at run-time especially when the resources are limited…

Performance · Computer Science 2015-03-24 Yash Gupta , Kamalakar Karlapalem

Scheduling Jobs with Random Resource Requirements in Computing Clusters

We consider a natural scheduling problem which arises in many distributed computing frameworks. Jobs with diverse resource requirements (e.g. memory requirements) arrive over time and must be served by a cluster of servers, each with a…

Networking and Internet Architecture · Computer Science 2019-01-21 Konstantinos Psychas , Javad Ghaderi

ROME: A Multi-Resource Job Scheduling Framework for Exascale HPC Systems

High-performance computing (HPC) is undergoing significant changes. Next generation HPC systems are equipped with diverse global and local resources, such as I/O burst buffer resources, memory resources (e.g., on-chip and off-chip RAM,…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-08-31 Yuping Fan

Hybrid Workload Scheduling on HPC Systems

Traditionally, on-demand, rigid, and malleable applications have been scheduled and executed on separate systems. The ever-growing workload demands and rapidly developing HPC infrastructure trigger the interest of converging these…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-14 Yuping Fan , Paul Rich , William Allcock , Michael Papka , Zhiling Lan

Deep Reinforcement Learning for Multi-Resource Multi-Machine Job Scheduling

Minimizing job scheduling time is a fundamental issue in data center networks that has been extensively studied in recent years. The incoming jobs require different CPU and memory units, and span different number of time slots. The…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-21 Weijia Chen , Yuedong Xu , Xiaofeng Wu

A Review of Tools and Techniques for Optimization of Workload Mapping and Scheduling in Heterogeneous HPC System

This paper presents a systematic review of mapping and scheduling strategies within the High-Performance Computing (HPC) compute continuum, with a particular emphasis on heterogeneous systems. It introduces a prototype workflow to establish…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-19 Aasish Kumar Sharma , Julian Kunkel

Do the Hard Stuff First: Scheduling Dependent Computations in Data-Analytics Clusters

We present a scheduler that improves cluster utilization and job completion times by packing tasks having multi-resource requirements and inter-dependencies. While the problem is algorithmically very hard, we achieve near-optimality on the…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-04-26 Robert Grandl , Srikanth Kandula , Sriram Rao , Aditya Akella , Janardhan Kulkarni

Exploring the Relation Between Two Levels of Scheduling Using a Novel Simulation Approach

Modern high performance computing (HPC) systems exhibit a rapid growth in size, both "horizontally" in the number of nodes, as well as "vertically" in the number of cores per node. As such, they offer additional levels of hardware…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-11-06 Ahmed Eleliemy , Ali Mohammed , Florina M. Ciorba

A Comparative Study of CPU Scheduling Algorithms

Developing CPU scheduling algorithms and understanding their impact in practice can be difficult and time consuming due to the need to modify and test operating system kernel code and measure the resulting performance on a consistent…

Operating Systems · Computer Science 2013-07-17 Neetu Goel , R. B. Garg

Online Job Failure Prediction in an HPC System

Modern High Performance Computing (HPC) systems are complex machines, with major impacts on economy and society. Along with their computational capability, their energy consumption is also steadily raising, representing a critical issue…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-31 Francesco Antici , Andrea Borghesi , Zeynep Kiziltan

A Survey on Dynamic Job Scheduling in Grid Environment Based on Heuristic Algorithms

Computational Grids are a new trend in distributed computing systems. They allow the sharing of geographically distributed resources in an efficient way, extending the boundaries of what we perceive as distributed computing. Various…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-08-24 D. Thilagavathi , Antony Selvadoss Thanamani