Related papers: SLAQ: Quality-Driven Scheduling for Distributed Ma…

Toward Efficient Online Scheduling for Distributed Machine Learning Systems

Recent years have witnessed a rapid growth of distributed machine learning (ML) frameworks, which exploit the massive parallelism of computing clusters to expedite ML training. However, the proliferation of distributed ML frameworks also…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-16 Menglu Yu , Jia Liu , Chuan Wu , Bo Ji , Elizabeth S. Bentley

DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters

More and more companies have deployed machine learning (ML) clusters, where deep learning (DL) models are trained for providing various AI-driven services. Efficient resource scheduling is essential for maximal utilization of expensive DL…

Machine Learning · Computer Science 2019-09-16 Yanghua Peng , Yixin Bao , Yangrui Chen , Chuan Wu , Chen Meng , Wei Lin

Structure-Aware Dynamic Scheduler for Parallel Machine Learning

Training large machine learning (ML) models with many variables or parameters can take a long time if one employs sequential procedures even with stochastic updates. A natural solution is to turn to distributed computing on a cluster;…

Machine Learning · Statistics 2013-12-31 Seunghak Lee , Jin Kyu Kim , Qirong Ho , Garth A. Gibson , Eric P. Xing

A Survey on Large-scale Machine Learning

Machine learning can provide deep insights into data, allowing machines to make high-quality predictions and having been widely used in real-world applications, such as text mining, visual classification, and recommender systems. However,…

Machine Learning · Computer Science 2020-08-11 Meng Wang , Weijie Fu , Xiangnan He , Shijie Hao , Xindong Wu

Learning to Schedule: A Supervised Learning Framework for Network-Aware Scheduling of Data-Intensive Workloads

Distributed cloud environments hosting data-intensive applications often experience slowdowns due to network congestion, asymmetric bandwidth, and inter-node data shuffling. These factors are typically not captured by traditional host-level…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-21 Sankalpa Timilsina , Susmit Shannigrahi

SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling

Scaling reinforcement learning (RL) has shown strong promise for enhancing the reasoning abilities of large language models (LLMs), particularly in tasks requiring long chain-of-thought generation. However, RL training efficiency is often…

Machine Learning · Computer Science 2026-03-25 Yiqi Zhang , Huiqiang Jiang , Xufang Luo , Zhihe Yang , Chengruidong Zhang , Yifei Shen , Dongsheng Li , Yuqing Yang , Lili Qiu , Yang You

Machine Learning and CPU (Central Processing Unit) Scheduling Co-Optimization over a Network of Computing Centers

In the rapidly evolving research on artificial intelligence (AI) the demand for fast, computationally efficient, and scalable solutions has increased in recent years. The problem of optimizing the computing resources for distributed machine…

Machine Learning · Computer Science 2025-10-30 Mohammadreza Doostmohammadian , Zulfiya R. Gabidullina , Hamid R. Rabiee

Online Job Scheduling in Distributed Machine Learning Clusters

Nowadays large-scale distributed machine learning systems have been deployed to support various analytics and intelligence services in IT firms. To train a large dataset and derive the prediction/inference model, e.g., a deep neural…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-04 Yixin Bao , Yanghua Peng , Chuan Wu , Zongpeng Li

Large-scale Machine Learning Cluster Scheduling via Multi-agent Graph Reinforcement Learning

Efficient scheduling of distributed deep learning (DL) jobs in large GPU clusters is crucial for resource efficiency and job performance. While server sharing among jobs improves resource utilization, interference among co-located DL jobs…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-28 Xiaoyang Zhao , Chuan Wu

On Delay-Optimal Scheduling in Queueing Systems with Replications

In modern computer systems, jobs are divided into short tasks and executed in parallel. Empirical observations in practical systems suggest that the task service times are highly random and the job service time is bottlenecked by the…

Performance · Computer Science 2017-02-08 Yin Sun , C. Emre Koksal , Ness B. Shroff

Tula: Optimizing Time, Cost, and Generalization in Distributed Large-Batch Training

Distributed training increases the number of batches processed per iteration either by scaling-out (adding more nodes) or scaling-up (increasing the batch-size). However, the largest configuration does not necessarily yield the best…

Machine Learning · Computer Science 2026-03-20 Sahil Tyagi , Feiyi Wang

LLMs can Schedule

The job shop scheduling problem (JSSP) remains a significant hurdle in optimizing production processes. This challenge involves efficiently allocating jobs to a limited number of machines while minimizing factors like total processing time…

Artificial Intelligence · Computer Science 2024-08-14 Henrik Abgaryan , Ararat Harutyunyan , Tristan Cazenave

Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning

Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors both at the per-job level and at the cluster-wide level. Most existing schedulers expect users to specify the number…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-27 Aurick Qiao , Sang Keun Choe , Suhas Jayaram Subramanya , Willie Neiswanger , Qirong Ho , Hao Zhang , Gregory R. Ganger , Eric P. Xing

Resource Allocation and Workload Scheduling for Large-Scale Distributed Deep Learning: A Survey

With rapidly increasing distributed deep learning workloads in large-scale data centers, efficient distributed deep learning framework strategies for resource allocation and workload scheduling have become the key to high-performance deep…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-13 Feng Liang , Zhen Zhang , Haifeng Lu , Chengming Li , Victor C. M. Leung , Yanyi Guo , Xiping Hu

SLAW: Scaled Loss Approximate Weighting for Efficient Multi-Task Learning

Multi-task learning (MTL) is a subfield of machine learning with important applications, but the multi-objective nature of optimization in MTL leads to difficulties in balancing training between tasks. The best MTL optimization methods…

Machine Learning · Computer Science 2021-09-20 Michael Crawshaw , Jana Košecká

Dependable Distributed Training of Compressed Machine Learning Models

The existing work on the distributed training of machine learning (ML) models has consistently overlooked the distribution of the achieved learning quality, focusing instead on its average value. This leads to a poor dependability}of the…

Machine Learning · Computer Science 2024-02-23 Francesco Malandrino , Giuseppe Di Giacomo , Marco Levorato , Carla Fabiana Chiasserini

Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving

Large language models (LLMs) iteratively generate text token by token, with memory usage increasing with the length of generated token sequences. Since the request generation length is generally unpredictable, it is difficult to estimate…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-11 Ke Cheng , Wen Hu , Zhi Wang , Hongen Peng , Jianguo Li , Sheng Zhang

Network-accelerated Distributed Machine Learning Using MLFabric

Existing distributed machine learning (DML) systems focus on improving the computational efficiency of distributed learning, whereas communication aspects have received less attention. Many DML systems treat the network as a blackbox. Thus,…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-07-02 Raajay Viswanathan , Aditya Akella

Learning the Quality of Machine Permutations in Job Shop Scheduling

In recent years, the power demonstrated by Machine Learning (ML) has increasingly attracted the interest of the optimization community that is starting to leverage ML for enhancing and automating the design of algorithms. One combinatorial…

Machine Learning · Computer Science 2022-09-19 Andrea Corsini , Simone Calderara , Mauro Dell'Amico

Strategies and Principles of Distributed Machine Learning on Big Data

The rise of Big Data has led to new demands for Machine Learning (ML) systems to learn complex models with millions to billions of parameters, that promise adequate capacity to digest massive datasets and offer powerful predictive analytics…

Machine Learning · Statistics 2016-01-01 Eric P. Xing , Qirong Ho , Pengtao Xie , Wei Dai