Related papers: Sequence-to-sequence models for workload interfere…

Learning to Schedule: A Supervised Learning Framework for Network-Aware Scheduling of Data-Intensive Workloads

Distributed cloud environments hosting data-intensive applications often experience slowdowns due to network congestion, asymmetric bandwidth, and inter-node data shuffling. These factors are typically not captured by traditional host-level…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-21 Sankalpa Timilsina , Susmit Shannigrahi

Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview

Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-02 Lauritz Thamsen , Dominik Scheinert , Jonathan Will , Jonathan Bader , Odej Kao

A HPC Co-Scheduler with Reinforcement Learning

Although High Performance Computing (HPC) users understand basic resource requirements such as the number of CPUs and memory limits, internal infrastructural utilization data is exclusively leveraged by cluster operators, who use it to…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-01-19 Abel Souza , Kristiaan Pelckmans , Johan Tordsson

Co-Scheduling Algorithms for High-Throughput Workload Execution

This paper investigates co-scheduling algorithms for processing a set of parallel applications. Instead of executing each application one by one, using a maximum degree of parallelism for each of them, we aim at scheduling several…

Data Structures and Algorithms · Computer Science 2013-05-01 Guillaume Aupy , Manu Shantharam , Anne Benoit , Yves Robert , Padma Raghavan

Scheduling Jobs with Random Resource Requirements in Computing Clusters

We consider a natural scheduling problem which arises in many distributed computing frameworks. Jobs with diverse resource requirements (e.g. memory requirements) arrive over time and must be served by a cluster of servers, each with a…

Networking and Internet Architecture · Computer Science 2019-01-21 Konstantinos Psychas , Javad Ghaderi

End-to-End Predictions-Based Resource Management Framework for Supercomputer Jobs

Job submissions of parallel applications to production supercomputer systems will have to be carefully tuned in terms of the job submission parameters to obtain minimum response times. In this work, we have developed an end-to-end resource…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-08-20 Swetha Hariharan , Prakash Murali , Abhishek Pasari , Sathish Vadhiyar

Job Shop Scheduling via Deep Reinforcement Learning: a Sequence to Sequence approach

Job scheduling is a well-known Combinatorial Optimization problem with endless applications. Well planned schedules bring many benefits in the context of automated systems: among others, they limit production costs and waste. Nevertheless,…

Artificial Intelligence · Computer Science 2023-08-04 Giovanni Bonetta , Davide Zago , Rossella Cancelliere , Andrea Grosso

Towards Accommodating Real-time Jobs on HPC Platforms

Increasing data volumes in scientific experiments necessitate the use of high-performance computing (HPC) resources for data analysis. In many scientific fields, the data generated from scientific instruments and supercomputer simulations…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-25 Sam Nickolay , Eun-Sung Jung , Rajkumar Kettimuthu , Ian Foster

Efficient Resource Matching in Heterogeneous Grid Using Resource Vector

In this paper, a method for efficient scheduling to obtain optimum job throughput in a distributed campus grid environment is presented; Traditional job schedulers determine job scheduling using user and job resource attributes. User…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-07-15 Srirangam V Addepallil , Per Andersen , George L Barnes

Co-scheduling Ensembles of In Situ Workflows

Molecular dynamics (MD) simulations are widely used to study large-scale molecular systems. HPC systems are ideal platforms to run these studies, however, reaching the necessary simulation timescale to detect rare processes is challenging,…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-22 Tu Mai Anh Do , Loïc Pottier , Rafael Ferreira da Silva , Frédéric Suter , Silvina Caíno-Lores , Michela Taufer , Ewa Deelman

Deep Reinforcement Learning for Multi-Resource Multi-Machine Job Scheduling

Minimizing job scheduling time is a fundamental issue in data center networks that has been extensively studied in recent years. The incoming jobs require different CPU and memory units, and span different number of time slots. The…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-21 Weijia Chen , Yuedong Xu , Xiaofeng Wu

Flow-time Optimization For Concurrent Open-Shop and Precedence Constrained Scheduling Models

Scheduling a set of jobs over a collection of machines is a fundamental problem that needs to be solved millions of times a day in various computing platforms: in operating systems, in large data clusters, and in data centers. Along with…

Data Structures and Algorithms · Computer Science 2018-07-10 Janardhan Kulkarni , Shi Li

Workload Failure Prediction for Data Centers

Failed workloads that consumed significant computational resources in time and space affect the efficiency of data centers significantly and thus limit the amount of scientific work that can be achieved. While the computational power has…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-13 Jie Li , Rui Wang , Ghazanfar Ali , Tommy Dang , Alan Sill , Yong Chen

Scheduling to Optimize Sojourn Time of Successful Jobs

Deep neural networks training jobs and other iterative computations frequently include checkpoints where jobs can be canceled based on the current value of monitored metrics. While most of existing results focus on the performance of all…

Performance · Computer Science 2022-09-30 Yuan Yao , Marco Paolieri , Leana Golubchik

A new job migration algorithm to improve data center efficiency

The under exploitation of the available resources risks to be one of the main problems for a computing center. The growing demand of computational power necessarily entails more complex approaches in the management of the computing…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-05-01 Federico Calzolari , Silvia Volpe

Effective Handling of Urgent Jobs - Speed Up Scheduling for Computing Applications

A queue is required when a service provider is not able to handle jobs arriving over the time. In a highly flexible and dynamic environment, some jobs might demand for faster execution at run-time especially when the resources are limited…

Performance · Computer Science 2015-03-24 Yash Gupta , Kamalakar Karlapalem

Analysis of Workflow Schedulers in Simulated Distributed Environments

Task graphs provide a simple way to describe scientific workflows (sets of tasks with dependencies) that can be executed on both HPC clusters and in the cloud. An important aspect of executing such graphs is the used scheduling algorithm.…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-04-18 Jakub Beránek , Stanislav Böhm , Vojtěch Cima

Smart Scheduling of Continuous Data-Intensive Workflows with Machine Learning Triggered Execution

To extract value from evergrowing volumes of data, coming from a number of different sources, and to drive decision making, organizations frequently resort to the composition of data processing workflows, since they are expressive,…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-12-13 Sérgio Esteves , Helena Galhardas , Luís Veiga

Scheduling Parallel-Task Jobs Subject to Packing and Placement Constraints

Motivated by modern parallel computing applications, we consider the problem of scheduling parallel-task jobs with heterogeneous resource requirements in a cluster of machines. Each job consists of a set of tasks that can be processed in…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-03 Mehrnoosh Shafiee , Javad Ghaderi

Towards Collaborative Optimization of Cluster Configurations for Distributed Dataflow Jobs

Analyzing large datasets with distributed dataflow systems requires the use of clusters. Public cloud providers offer a large variety and quantity of resources that can be used for such clusters. However, picking the appropriate resources…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-28 Jonathan Will , Jonathan Bader , Lauritz Thamsen