Related papers: Learning to Schedule: A Supervised Learning Framew…

Toward Efficient Online Scheduling for Distributed Machine Learning Systems

Recent years have witnessed a rapid growth of distributed machine learning (ML) frameworks, which exploit the massive parallelism of computing clusters to expedite ML training. However, the proliferation of distributed ML frameworks also…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-16 Menglu Yu , Jia Liu , Chuan Wu , Bo Ji , Elizabeth S. Bentley

Online Job Scheduling in Distributed Machine Learning Clusters

Nowadays large-scale distributed machine learning systems have been deployed to support various analytics and intelligence services in IT firms. To train a large dataset and derive the prediction/inference model, e.g., a deep neural…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-04 Yixin Bao , Yanghua Peng , Chuan Wu , Zongpeng Li

Adaptive Scheduling for Machine Learning Tasks over Networks

A key functionality of emerging connected autonomous systems such as smart transportation systems, smart cities, and the industrial Internet-of-Things, is the ability to process and learn from data collected at different physical locations.…

Machine Learning · Computer Science 2021-01-26 Konstantinos Gatsis

Scheduling Jobs with Random Resource Requirements in Computing Clusters

We consider a natural scheduling problem which arises in many distributed computing frameworks. Jobs with diverse resource requirements (e.g. memory requirements) arrive over time and must be served by a cluster of servers, each with a…

Networking and Internet Architecture · Computer Science 2019-01-21 Konstantinos Psychas , Javad Ghaderi

Sequence-to-sequence models for workload interference

Co-scheduling of jobs in data-centers is a challenging scenario, where jobs can compete for resources yielding to severe slowdowns or failed executions. Efficient job placement on environments where resources are shared requires awareness…

Machine Learning · Computer Science 2020-07-07 David Buchaca Prats , Joan Marcual , Josep Lluís Berral , David Carrera

Enhancing Cluster Scheduling in HPC: A Continuous Transfer Learning for Real-Time Optimization

This study presents a machine learning-assisted approach to optimize task scheduling in cluster systems, focusing on node-affinity constraints. Traditional schedulers like Kubernetes struggle with real-time adaptability, whereas the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-30 Leszek Sliwko , Jolanta Mizera-Pietraszko

DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters

More and more companies have deployed machine learning (ML) clusters, where deep learning (DL) models are trained for providing various AI-driven services. Efficient resource scheduling is essential for maximal utilization of expensive DL…

Machine Learning · Computer Science 2019-09-16 Yanghua Peng , Yixin Bao , Yangrui Chen , Chuan Wu , Chen Meng , Wei Lin

Large-scale Machine Learning Cluster Scheduling via Multi-agent Graph Reinforcement Learning

Efficient scheduling of distributed deep learning (DL) jobs in large GPU clusters is crucial for resource efficiency and job performance. While server sharing among jobs improves resource utilization, interference among co-located DL jobs…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-28 Xiaoyang Zhao , Chuan Wu

Better than the Best: Gradient-based Improper Reinforcement Learning for Network Scheduling

We consider the problem of scheduling in constrained queueing networks with a view to minimizing packet delay. Modern communication systems are becoming increasingly complex, and are required to handle multiple types of traffic with widely…

Machine Learning · Computer Science 2021-05-04 Mohammani Zaki , Avi Mohan , Aditya Gopalan , Shie Mannor

Energy-aware Scheduling of Jobs in Heterogeneous Cluster Systems Using Deep Reinforcement Learning

Energy consumption is one of the most critical concerns in designing computing devices, ranging from portable embedded systems to computer cluster systems. Furthermore, in the past decade, cluster systems have increasingly risen as popular…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-12-12 Amirhossein Esmaili , Massoud Pedram

Metronome: Efficient Scheduling for Periodic Traffic Jobs with Network and Priority Awareness

With the rapid growth in computing power demand, cloud native networks have emerged as a promising solution to address the challenges of efficient resource coordination, particularly in coping with the dynamic fluctuations of network…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-11 Hao Jiang , Meng Qin , Ruijie Kuai , Dandan Liang , Yue Gao

Deep Reinforcement Learning for Multi-Resource Multi-Machine Job Scheduling

Minimizing job scheduling time is a fundamental issue in data center networks that has been extensively studied in recent years. The incoming jobs require different CPU and memory units, and span different number of time slots. The…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-21 Weijia Chen , Yuedong Xu , Xiaofeng Wu

Scheduling in Data Intensive and Network Aware (DIANA) Grid Environments

In Grids scheduling decisions are often made on the basis of jobs being either data or computation intensive: in data intensive situations jobs may be pushed to the data and in computation intensive situations data may be pulled to the…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-07-06 Richard McClatchey , Ashiq Anjum , Heinz Stockinger , Arshad Ali , Ian Willers , Michael Thomas

Hugo: A Cluster Scheduler that Efficiently Learns to Select Complementary Data-Parallel Jobs

Distributed data processing systems like MapReduce, Spark, and Flink are popular tools for analysis of large datasets with cluster resources. Yet, users often overprovision resources for their data processing jobs, while the resource usage…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-02-16 Lauritz Thamsen , Ilya Verbitskiy , Sasho Nedelkoski , Vinh Thuy Tran , Vinicius Meyer , Miguel G. Xavier , Odej Kao , Cesar A. F. De Rose

Collaborative Learning-Based Scheduling for Kubernetes-Oriented Edge-Cloud Network

Kubernetes (k8s) has the potential to coordinate distributed edge resources and centralized cloud resources, but currently lacks a specialized scheduling framework for edge-cloud networks. Besides, the hierarchical distribution of…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-11 Shihao Shen , Yiwen Han , Xiaofei Wang , Shiqiang Wang , Victor C. M. Leung

Learning While Scheduling in Multi-Server Systems with Unknown Statistics: MaxWeight with Discounted UCB

Multi-server queueing systems are widely used models for job scheduling in machine learning, wireless networks, crowdsourcing, and healthcare systems. This paper considers a multi-server system with multiple servers and multiple types of…

Machine Learning · Computer Science 2023-06-05 Zixian Yang , R. Srikant , Lei Ying

Improving Overhead Computation and pre-processing Time for Grid Scheduling System

Computational Grid is enormous environments with heterogeneous resources and stable infrastructures among other Internet-based computing systems. However, the managing of resources in such systems has its special problems. Scheduler systems…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-05-07 Asgarali Bouyer , Mohammad Javad hoseyni , Abdul Hanan Abdullah

Learning Coordination Policies over Heterogeneous Graphs for Human-Robot Teams via Recurrent Neural Schedule Propagation

As human-robot collaboration increases in the workforce, it becomes essential for human-robot teams to coordinate efficiently and intuitively. Traditional approaches for human-robot scheduling either utilize exact methods that are…

Artificial Intelligence · Computer Science 2023-02-01 Batuhan Altundas , Zheyuan Wang , Joshua Bishop , Matthew Gombolay

Analysis of Workflow Schedulers in Simulated Distributed Environments

Task graphs provide a simple way to describe scientific workflows (sets of tasks with dependencies) that can be executed on both HPC clusters and in the cloud. An important aspect of executing such graphs is the used scheduling algorithm.…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-04-18 Jakub Beránek , Stanislav Böhm , Vojtěch Cima

Resource Heterogeneity-Aware and Utilization-Enhanced Scheduling for Deep Learning Clusters

Scheduling deep learning (DL) models to train on powerful clusters with accelerators like GPUs and TPUs, presently falls short, either lacking fine-grained heterogeneity awareness or leaving resources substantially under-utilized. To fill…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-17 Abeda Sultana , Nabin Pakka , Fei Xu , Xu Yuan , Li Chen , Nian-Feng Tzeng