Related papers: Bioinformatics Computational Cluster Batch Task Pr…

Machine Learning for Predictive Analytics of Compute Cluster Jobs

We address the problem of predicting whether sufficient memory and CPU resources have been requested for jobs at submission time. For this purpose, we examine the task of training a supervised machine learning system to predict the outcome…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-05 Dan Andresen , William Hsu , Huichen Yang , Adedolapo Okanlawon

Helping HPC Users Specify Job Memory Requirements via Machine Learning

Resource allocation in High Performance Computing (HPC) settings is still not easy for end-users due to the wide variety of application and environment configuration options. Users have difficulties to estimate the number of processors and…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-10 Eduardo R. Rodrigues , Renato L. F. Cunha , Marco A. S. Netto , Michael Spriggs

A HPC Co-Scheduler with Reinforcement Learning

Although High Performance Computing (HPC) users understand basic resource requirements such as the number of CPUs and memory limits, internal infrastructural utilization data is exclusively leveraged by cluster operators, who use it to…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-01-19 Abel Souza , Kristiaan Pelckmans , Johan Tordsson

Scheduling Jobs with Random Resource Requirements in Computing Clusters

We consider a natural scheduling problem which arises in many distributed computing frameworks. Jobs with diverse resource requirements (e.g. memory requirements) arrive over time and must be served by a cluster of servers, each with a…

Networking and Internet Architecture · Computer Science 2019-01-21 Konstantinos Psychas , Javad Ghaderi

Predicting the Performance of Scientific Workflow Tasks for Cluster Resource Management: An Overview of the State of the Art

Scientific workflow management systems support large-scale data analysis on cluster infrastructures. For this, they interact with resource managers which schedule workflow tasks onto cluster nodes. In addition to workflow task descriptions,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-30 Jonathan Bader , Kathleen West , Soeren Becker , Svetlana Kulagina , Fabian Lehmann , Lauritz Thamsen , Henning Meyerhenke , Odej Kao

Machine Learning and CPU (Central Processing Unit) Scheduling Co-Optimization over a Network of Computing Centers

In the rapidly evolving research on artificial intelligence (AI) the demand for fast, computationally efficient, and scalable solutions has increased in recent years. The problem of optimizing the computing resources for distributed machine…

Machine Learning · Computer Science 2025-10-30 Mohammadreza Doostmohammadian , Zulfiya R. Gabidullina , Hamid R. Rabiee

Workload Failure Prediction for Data Centers

Failed workloads that consumed significant computational resources in time and space affect the efficiency of data centers significantly and thus limit the amount of scientific work that can be achieved. While the computational power has…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-13 Jie Li , Rui Wang , Ghazanfar Ali , Tommy Dang , Alan Sill , Yong Chen

Enhancing Cluster Scheduling in HPC: A Continuous Transfer Learning for Real-Time Optimization

This study presents a machine learning-assisted approach to optimize task scheduling in cluster systems, focusing on node-affinity constraints. Traditional schedulers like Kubernetes struggle with real-time adaptability, whereas the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-30 Leszek Sliwko , Jolanta Mizera-Pietraszko

Do the Hard Stuff First: Scheduling Dependent Computations in Data-Analytics Clusters

We present a scheduler that improves cluster utilization and job completion times by packing tasks having multi-resource requirements and inter-dependencies. While the problem is algorithmically very hard, we achieve near-optimality on the…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-04-26 Robert Grandl , Srikanth Kandula , Sriram Rao , Aditya Akella , Janardhan Kulkarni

Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview

Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-02 Lauritz Thamsen , Dominik Scheinert , Jonathan Will , Jonathan Bader , Odej Kao

Revisiting Reliability in Large-Scale Machine Learning Research Clusters

Reliability is a fundamental challenge in operating large-scale machine learning (ML) infrastructures, particularly as the scale of ML models and training clusters continues to grow. Despite decades of research on infrastructure failures,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-10 Apostolos Kokolis , Michael Kuchnik , John Hoffman , Adithya Kumar , Parth Malani , Faye Ma , Zachary DeVito , Shubho Sengupta , Kalyan Saladi , Carole-Jean Wu

Online Job Scheduling in Distributed Machine Learning Clusters

Nowadays large-scale distributed machine learning systems have been deployed to support various analytics and intelligence services in IT firms. To train a large dataset and derive the prediction/inference model, e.g., a deep neural…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-04 Yixin Bao , Yanghua Peng , Chuan Wu , Zongpeng Li

Use of Data Mining in Scheduler Optimization

The operating system's role in a computer system is to manage the various resources. One of these resources is the Central Processing Unit. It is managed by a component of the operating system called the CPU scheduler. Schedulers are…

Operating Systems · Computer Science 2010-11-09 George Anderson , Tshilidzi Marwala , Fulufhelo V. Nelwamondo

Automating Fault Tolerance in High-Performance Computational Biological Jobs Using Multi-Agent Approaches

Background: Large-scale biological jobs on high-performance computing systems require manual intervention if one or more computing cores on which they execute fail. This places not only a cost on the maintenance of the job, but also a cost…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-03-04 Blesson Varghese , Gerard McKee , Vassil Alexandrov

Online Distributed Scheduling on a Fault-prone Parallel System

We consider a parallel system of $m$ identical machines prone to unpredictable crashes and restarts, trying to cope with the continuous arrival of tasks to be executed. Tasks have different computational requirements (i.e., processing time…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-03-21 Elli Zavou , Antonio Fernández Anta

Container Profiler: Profiling Resource Utilization of Containerized Big Data Pipelines

This paper presents the Container Profiler, a software tool that measures and records the resource usage of any containerized task. Our tool profiles the CPU, memory, disk, and network utilization of containerized tasks collecting over…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-02-08 Varik Hoang , Ling-Hong Hung , David Perez , Huazeng Deng , Raymond Schooley , Niharika Arumilli , Ka Yee Yeung , Wes Lloyd

Per-Instance Algorithm Selection for Recommender Systems via Instance Clustering

Recommendation algorithms perform differently if the users, recommendation contexts, applications, and user interfaces vary even slightly. It is similarly observed in other fields, such as combinatorial problem solving, that algorithms…

Information Retrieval · Computer Science 2021-01-01 Andrew Collins , Laura Tierney , Joeran Beel

Task Runtime Prediction in Scientific Workflows Using an Online Incremental Learning Approach

Many algorithms in workflow scheduling and resource provisioning rely on the performance estimation of tasks to produce a scheduling plan. A profiler that is capable of modeling the execution of tasks and predicting their runtime…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-03-01 Muhammad H. Hilman , Maria A. Rodriguez , Rajkumar Buyya

Machine Learning Based Prediction and Classification of Computational Jobs in Cloud Computing Centers

With the rapid growth of the data volume and the fast increasing of the computational model complexity in the scenario of cloud computing, it becomes an important topic that how to handle users' requests by scheduling computational jobs and…

Machine Learning · Computer Science 2021-05-10 Zheqi Zhu , Pingyi Fan

Optimal Virtual Cluster-based Multiprocessor Scheduling

Scheduling of constrained deadline sporadic task systems on multiprocessor platforms is an area which has received much attention in the recent past. It is widely believed that finding an optimal scheduler is hard, and therefore most…

Operating Systems · Computer Science 2020-04-07 Arvind Easwaran , Insik Shin , Insup Lee