Related papers: Duration-Informed Workload Scheduler

Workload Failure Prediction for Data Centers

Failed workloads that consumed significant computational resources in time and space affect the efficiency of data centers significantly and thus limit the amount of scientific work that can be achieved. While the computational power has…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-13 Jie Li , Rui Wang , Ghazanfar Ali , Tommy Dang , Alan Sill , Yong Chen

Scheduler Technologies in Support of High Performance Data Analysis

Job schedulers are a key component of scalable computing infrastructures. They orchestrate all of the work executed on the computing infrastructure and directly impact the effectiveness of the system. Recently, job workloads have…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-03-06 Albert Reuther , Chansup Byun , William Arcand , David Bestor , Bill Bergeron , Matthew Hubbell , Michael Jones , Peter Michaleas , Andrew Prout , Antonio Rosa , Jeremy Kepner

Learning to Schedule: A Supervised Learning Framework for Network-Aware Scheduling of Data-Intensive Workloads

Distributed cloud environments hosting data-intensive applications often experience slowdowns due to network congestion, asymmetric bandwidth, and inter-node data shuffling. These factors are typically not captured by traditional host-level…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-21 Sankalpa Timilsina , Susmit Shannigrahi

On Delay-Optimal Scheduling in Queueing Systems with Replications

In modern computer systems, jobs are divided into short tasks and executed in parallel. Empirical observations in practical systems suggest that the task service times are highly random and the job service time is bottlenecked by the…

Performance · Computer Science 2017-02-08 Yin Sun , C. Emre Koksal , Ness B. Shroff

Scalable System Scheduling for HPC and Big Data

In the rapidly expanding field of parallel processing, job schedulers are the "operating systems" of modern big data architectures and supercomputing systems. Job schedulers allocate computing resources and control the execution of…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-03-06 Albert Reuther , Chansup Byun , William Arcand , David Bestor , Bill Bergeron , Matthew Hubbell , Michael Jones , Peter Michaleas , Andrew Prout , Antonio Rosa , Jeremy Kepner

Timely-Throughput Optimal Scheduling with Prediction

Motivated by the increasing importance of providing delay-guaranteed services in general computing and communication systems, and the recent wide adoption of learning and prediction in network control, in this work, we consider a general…

Networking and Internet Architecture · Computer Science 2018-01-08 Kun Chen , Longbo Huang

Deadline-Aware Joint Task Scheduling and Offloading in Mobile Edge Computing Systems

The demand for stringent interactive quality-of-service has intensified in both mobile edge computing (MEC) and cloud systems, driven by the imperative to improve user experiences. As a result, the processing of computation-intensive tasks…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-28 Ngoc Hung Nguyen , Van-Dinh Nguyen , Anh Tuan Nguyen , Nguyen Van Thieu , Hoang Nam Nguyen , Symeon Chatzinotas

Efficient Instruction Scheduling using Real-time Load Delay Tracking

Many hardware structures in today's high-performance out-of-order processors do not scale in an efficient way. To address this, different solutions have been proposed that build execution schedules in an energy-efficient manner. Issue time…

Hardware Architecture · Computer Science 2021-09-08 Andreas Diavastos , Trevor E. Carlson

Helping HPC Users Specify Job Memory Requirements via Machine Learning

Resource allocation in High Performance Computing (HPC) settings is still not easy for end-users due to the wide variety of application and environment configuration options. Users have difficulties to estimate the number of processors and…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-10 Eduardo R. Rodrigues , Renato L. F. Cunha , Marco A. S. Netto , Michael Spriggs

Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing

Large Language Model (LLM) workloads have distinct prefill and decode phases with different compute and memory requirements which should ideally be accounted for when scheduling input queries across different LLM instances in a cluster.…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-08 Kunal Jain , Anjaly Parayil , Ankur Mallick , Esha Choukse , Xiaoting Qin , Jue Zhang , Íñigo Goiri , Rujia Wang , Chetan Bansal , Victor Rühle , Anoop Kulkarni , Steve Kofsky , Saravan Rajmohan

Scheduling Real-time Deep Learning Services as Imprecise Computations

The paper presents an efficient real-time scheduling algorithm for intelligent real-time edge services, defined as those that perform machine intelligence tasks, such as voice recognition, LIDAR processing, or machine vision, on behalf of…

Machine Learning · Computer Science 2020-11-03 Shuochao Yao , Yifan Hao , Yiran Zhao , Huajie Shao , Dongxin Liu , Shengzhong Liu , Tianshi Wang , Jinyang Li , Tarek Abdelzaher

Do the Hard Stuff First: Scheduling Dependent Computations in Data-Analytics Clusters

We present a scheduler that improves cluster utilization and job completion times by packing tasks having multi-resource requirements and inter-dependencies. While the problem is algorithmically very hard, we achieve near-optimality on the…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-04-26 Robert Grandl , Srikanth Kandula , Sriram Rao , Aditya Akella , Janardhan Kulkarni

Scheduling to Optimize Sojourn Time of Successful Jobs

Deep neural networks training jobs and other iterative computations frequently include checkpoints where jobs can be canceled based on the current value of monitored metrics. While most of existing results focus on the performance of all…

Performance · Computer Science 2022-09-30 Yuan Yao , Marco Paolieri , Leana Golubchik

A Workflow-Forecast Approach To The Task Scheduling Problem In Distributed Computing Systems

The aim of this paper is to provide a description of deep-learning-based scheduling approach for academic-purpose high-performance computing systems. The share of academic-purpose distributed computing systems (DCS) reaches 17.4 percents…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-10-08 Andrey Gritsenko

A Simulator for Data-Intensive Job Scheduling

Despite the fact that size-based schedulers can give excellent results in terms of both average response times and fairness, data-intensive computing execution engines generally do not employ size-based schedulers, mainly because of the…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-08-22 Matteo Dell'Amico

Learning While Scheduling in Multi-Server Systems with Unknown Statistics: MaxWeight with Discounted UCB

Multi-server queueing systems are widely used models for job scheduling in machine learning, wireless networks, crowdsourcing, and healthcare systems. This paper considers a multi-server system with multiple servers and multiple types of…

Machine Learning · Computer Science 2023-06-05 Zixian Yang , R. Srikant , Lei Ying

DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters

More and more companies have deployed machine learning (ML) clusters, where deep learning (DL) models are trained for providing various AI-driven services. Efficient resource scheduling is essential for maximal utilization of expensive DL…

Machine Learning · Computer Science 2019-09-16 Yanghua Peng , Yixin Bao , Yangrui Chen , Chuan Wu , Chen Meng , Wei Lin

RLScheduler: An Automated HPC Batch Job Scheduler Using Reinforcement Learning

Today high-performance computing (HPC) platforms are still dominated by batch jobs. Accordingly, effective batch job scheduling is crucial to obtain high system efficiency. Existing HPC batch job schedulers typically leverage heuristic…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-03 Di Zhang , Dong Dai , Youbiao He , Forrest Sheng Bao , Bing Xie

Sequence-to-sequence models for workload interference

Co-scheduling of jobs in data-centers is a challenging scenario, where jobs can compete for resources yielding to severe slowdowns or failed executions. Efficient job placement on environments where resources are shared requires awareness…

Machine Learning · Computer Science 2020-07-07 David Buchaca Prats , Joan Marcual , Josep Lluís Berral , David Carrera

Node-Based Job Scheduling for Large Scale Simulations of Short Running Jobs

Diverse workloads such as interactive supercomputing, big data analysis, and large-scale AI algorithm development, requires a high-performance scheduler. This paper presents a novel node-based scheduling approach for large scale simulations…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-13 Chansup Byun , William Arcand , David Bestor , Bill Bergeron , Vijay Gadepally , Michael Houle , Matthew Hubbell , Michael Jones , Anna Klein , Peter Michaleas , Lauren Milechin , Julie Mullen , Andrew Prout , Albert Reuther , Antonio Rosa , Siddharth Samsi , Charles Yee , Jeremy Kepner