Related papers: Enhancing Cluster Scheduling in HPC: A Continuous …

Fine-Grained Scheduling for Containerized HPC Workloads in Kubernetes Clusters

Containerization technology offers lightweight OS-level virtualization, and enables portability, reproducibility, and flexibility by packing applications with low performance overhead and low effort to maintain and scale them. Moreover,…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-22 Peini Liu , Jordi Guitart

Learning to Schedule: A Supervised Learning Framework for Network-Aware Scheduling of Data-Intensive Workloads

Distributed cloud environments hosting data-intensive applications often experience slowdowns due to network congestion, asymmetric bandwidth, and inter-node data shuffling. These factors are typically not captured by traditional host-level…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-21 Sankalpa Timilsina , Susmit Shannigrahi

A HPC Co-Scheduler with Reinforcement Learning

Although High Performance Computing (HPC) users understand basic resource requirements such as the number of CPUs and memory limits, internal infrastructural utilization data is exclusively leveraged by cluster operators, who use it to…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-01-19 Abel Souza , Kristiaan Pelckmans , Johan Tordsson

Machine Learning and CPU (Central Processing Unit) Scheduling Co-Optimization over a Network of Computing Centers

In the rapidly evolving research on artificial intelligence (AI) the demand for fast, computationally efficient, and scalable solutions has increased in recent years. The problem of optimizing the computing resources for distributed machine…

Machine Learning · Computer Science 2025-10-30 Mohammadreza Doostmohammadian , Zulfiya R. Gabidullina , Hamid R. Rabiee

Cluster Workload Allocation: A Predictive Approach Leveraging Machine Learning Efficiency

This research investigates how Machine Learning (ML) algorithms can assist in workload allocation strategies by detecting tasks with node affinity operators (referred to as constraint operators), which constrain their execution to a limited…

Machine Learning · Computer Science 2025-09-25 Leszek Sliwko

Energy-aware Scheduling of Jobs in Heterogeneous Cluster Systems Using Deep Reinforcement Learning

Energy consumption is one of the most critical concerns in designing computing devices, ranging from portable embedded systems to computer cluster systems. Furthermore, in the past decade, cluster systems have increasingly risen as popular…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-12-12 Amirhossein Esmaili , Massoud Pedram

Enhancing Kubernetes Automated Scheduling with Deep Learning and Reinforcement Techniques for Large-Scale Cloud Computing Optimization

With the continuous expansion of the scale of cloud computing applications, artificial intelligence technologies such as Deep Learning and Reinforcement Learning have gradually become the key tools to solve the automated task scheduling of…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-14 Zheng Xu , Yulu Gong , Yanlin Zhou , Qiaozhi Bao , Wenpin Qian

Toward Efficient Online Scheduling for Distributed Machine Learning Systems

Recent years have witnessed a rapid growth of distributed machine learning (ML) frameworks, which exploit the massive parallelism of computing clusters to expedite ML training. However, the proliferation of distributed ML frameworks also…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-16 Menglu Yu , Jia Liu , Chuan Wu , Bo Ji , Elizabeth S. Bentley

OmniLearn: A Framework for Distributed Deep Learning over Heterogeneous Clusters

Deep learning systems are optimized for clusters with homogeneous resources. However, heterogeneity is prevalent in computing infrastructure across edge, cloud and HPC. When training neural networks using stochastic gradient descent…

Machine Learning · Computer Science 2025-03-25 Sahil Tyagi , Prateek Sharma

Transferable Deep Clustering Model

Deep learning has shown remarkable success in the field of clustering recently. However, how to transfer a trained clustering model on a source domain to a target domain by leveraging the acquired knowledge to guide the clustering process…

Machine Learning · Computer Science 2023-10-10 Zheng Zhang , Liang Zhao

Mitigating context switching in densely packed Linux clusters with Latency-Aware Group Scheduling

Cluster orchestrators such as Kubernetes depend on accurate estimates of node capacity and job requirements. Inaccuracies in either lead to poor placement decisions and degraded cluster performance. In this paper, we show that in densely…

Operating Systems · Computer Science 2025-08-22 Al Amjad Tawfiq Isstaif , Evangelia Kalyvianaki , Richard Mortier

Cloud-Based Scheduling Mechanism for Scalable and Resource-Efficient Centralized Controllers

This paper proposes a novel approach to address the challenges of deploying complex robotic software in large-scale systems, i.e., Centralized Nonlinear Model Predictive Controllers (CNMPCs) for multi-agent systems. The proposed approach is…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-13 Achilleas Santi Seisa , Sumeet Gajanan Satpute , George Nikolakopoulos

Periodic I/O scheduling for super-computers

With the ever-growing need of data in HPC applications, the congestion at the I/O level becomes critical in super-computers. Architectural enhancement such as burst-buffers and pre-fetching are added to machines, but are not sufficient to…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-02-23 Guillaume Aupy , Ana Gainaru , Valentin Le Fèvre

Network Contention-Aware Cluster Scheduling with Reinforcement Learning

With continuous advances in deep learning, distributed training is becoming common in GPU clusters. Specifically, for emerging workloads with diverse amounts, ratios, and patterns of communication, we observe that network contention can…

Machine Learning · Computer Science 2023-11-01 Junyeol Ryu , Jeongyoon Eo

Kub: Enabling Elastic HPC Workloads on Containerized Environments

The conventional model of resource allocation in HPC systems is static. Thus, a job cannot leverage newly available resources in the system or release underutilized resources during the execution. In this paper, we present Kub, a…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-15 Daniel Medeiros , Jacob Wahlgren , Gabin Schieffer , Ivy Peng

Towards Continually Learning Application Performance Models

Machine learning-based performance models are increasingly being used to build critical job scheduling and application optimization decisions. Traditionally, these models assume that data distribution does not change as more samples are…

Machine Learning · Computer Science 2023-10-27 Ray A. O. Sinurat , Anurag Daram , Haryadi S. Gunawi , Robert B. Ross , Sandeep Madireddy

Survey of adaptive containerization architectures for HPC

Containers offer an array of advantages that benefit research reproducibility and portability across groups and systems. As container tools mature, container security improves, and High-performance computing (HPC) and cloud system tools…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-24 Tiziano Müller , Nina Mujkanovic , Juan J. Durillo , Nicolay Hammer

Dynamic Scheduling Strategies for Resource Optimization in Computing Environments

The rapid development of cloud-native architecture has promoted the widespread application of container technology, but the optimization problems in container scheduling and resource management still face many challenges. This paper…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-24 Xiaoye Wang

Distributedness based scheduling

Efficient utilization of computing resources in a Kubernetes cluster is often constrained by the uneven distribution of pods with similar usage patterns. This paper presents a novel scheduling strategy designed to optimize the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-04 Paritosh Ranjan , Surajit Majumder , Prodip Roy , Bhuban Padhan

Evaluating Malleable Job Scheduling in HPC Clusters using Real-World Workloads

Optimizing resource utilization in high-performance computing (HPC) clusters is essential for maximizing both system efficiency and user satisfaction. However, traditional rigid job scheduling often results in underutilized resources and…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-20 Patrick Zojer , Jonas Posner , Taylan Özden