Related papers: Structure-Aware Dynamic Scheduler for Parallel Mac…

Primitives for Dynamic Big Model Parallelism

When training large machine learning models with many variables or parameters, a single machine is often inadequate since the model may be too large to fit in memory, while training can take a long time even with stochastic updates. A…

Machine Learning · Statistics 2014-06-19 Seunghak Lee , Jin Kyu Kim , Xun Zheng , Qirong Ho , Garth A. Gibson , Eric P. Xing

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning

Training machine learning (ML) models with large datasets can incur significant resource contention on shared clusters. This training typically involves many iterations that continually improve the quality of the model. Yet in exploratory…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-02-15 Haoyu Zhang , Logan Stafman , Andrew Or , Michael J. Freedman

Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling

This paper presents Block, a distributed scheduling framework designed to optimize load balancing and auto-provisioning across instances in large language model serving frameworks by leveraging contextual information from incoming requests.…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-14 Wei Da , Evangelia Kalyvianaki

Analysis of Dynamic Scheduling Strategies for Matrix Multiplication on Heterogeneous Platforms

The tremendous increase in the size and heterogeneity of supercomputers makes it very difficult to predict the performance of a scheduling algorithm. Therefore, dynamic solutions, where scheduling decisions are made at runtime have…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-04-16 Olivier Beaumont , Loris Marchal

Learning to Schedule: A Supervised Learning Framework for Network-Aware Scheduling of Data-Intensive Workloads

Distributed cloud environments hosting data-intensive applications often experience slowdowns due to network congestion, asymmetric bandwidth, and inter-node data shuffling. These factors are typically not captured by traditional host-level…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-21 Sankalpa Timilsina , Susmit Shannigrahi

DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs

Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation, distinguished by their native support for parallel decoding. In practice, block inference is crucial for avoiding order misalignment in…

Computation and Language · Computer Science 2026-03-17 Lizhuo Luo , Shenggui Li , Yonggang Wen , Tianwei Zhang

Strategies and Principles of Distributed Machine Learning on Big Data

The rise of Big Data has led to new demands for Machine Learning (ML) systems to learn complex models with millions to billions of parameters, that promise adequate capacity to digest massive datasets and offer powerful predictive analytics…

Machine Learning · Statistics 2016-01-01 Eric P. Xing , Qirong Ho , Pengtao Xie , Wei Dai

Petuum: A New Platform for Distributed Machine Learning on Big Data

What is a systematic way to efficiently apply a wide spectrum of advanced ML programs to industrial scale problems, using Big Models (up to 100s of billions of parameters) on Big Data (up to terabytes or petabytes)? Modern parallelization…

Machine Learning · Statistics 2015-05-18 Eric P. Xing , Qirong Ho , Wei Dai , Jin Kyu Kim , Jinliang Wei , Seunghak Lee , Xun Zheng , Pengtao Xie , Abhimanu Kumar , Yaoliang Yu

Online Job Scheduling in Distributed Machine Learning Clusters

Nowadays large-scale distributed machine learning systems have been deployed to support various analytics and intelligence services in IT firms. To train a large dataset and derive the prediction/inference model, e.g., a deep neural…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-04 Yixin Bao , Yanghua Peng , Chuan Wu , Zongpeng Li

CADS: Core-Aware Dynamic Scheduler for Multicore Memory Controllers

Memory controller scheduling is crucial in multicore processors, where DRAM bandwidth is shared. Since increased number of requests from multiple cores of processors becomes a source of bottleneck, scheduling the requests efficiently is…

Hardware Architecture · Computer Science 2019-07-19 Eduardo Olmedo Sanchez , Xian-He Sun

Toward Efficient Online Scheduling for Distributed Machine Learning Systems

Recent years have witnessed a rapid growth of distributed machine learning (ML) frameworks, which exploit the massive parallelism of computing clusters to expedite ML training. However, the proliferation of distributed ML frameworks also…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-16 Menglu Yu , Jia Liu , Chuan Wu , Bo Ji , Elizabeth S. Bentley

Distributed Training Large-Scale Deep Architectures

Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-09-21 Shang-Xuan Zou , Chun-Yen Chen , Jui-Lin Wu , Chun-Nan Chou , Chia-Chin Tsao , Kuan-Chieh Tung , Ting-Wei Lin , Cheng-Lung Sung , Edward Y. Chang

Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning

Dynamic adaptation has become an essential technique in accelerating distributed machine learning (ML) training. Recent studies have shown that dynamically adjusting model structure (e.g., lottery ticket hypothesis) or hyperparameters…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-10-04 Pengfei Zheng , Rui Pan , Tarannum Khan , Shivaram Venkataraman , Aditya Akella

Scaling Distributed Machine Learning with In-Network Aggregation

Training machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-01 Amedeo Sapio , Marco Canini , Chen-Yu Ho , Jacob Nelson , Panos Kalnis , Changhoon Kim , Arvind Krishnamurthy , Masoud Moshref , Dan R. K. Ports , Peter Richtárik

Dynamic Modality Scheduling for Multimodal Large Models via Confidence, Uncertainty, and Semantic Consistency

Multimodal Large Models (MLLMs) have achieved remarkable progress in vision-language understanding and generation tasks. However, existing MLLMs typically rely on static modality fusion strategies, which treat all modalities equally…

Computer Vision and Pattern Recognition · Computer Science 2025-06-17 Hiroshi Tanaka , Anika Rao , Hana Satou , Michael Johnson , Sofia García

Automatic Operator-level Parallelism Planning for Distributed Deep Learning -- A Mixed-Integer Programming Approach

As the artificial intelligence community advances into the era of large models with billions of parameters, distributed training and inference have become essential. While various parallelism strategies-data, model, sequence, and…

Machine Learning · Computer Science 2025-03-13 Ruifeng She , Bowen Pang , Kai Li , Zehua Liu , Tao Zhong

Large-scale Machine Learning Cluster Scheduling via Multi-agent Graph Reinforcement Learning

Efficient scheduling of distributed deep learning (DL) jobs in large GPU clusters is crucial for resource efficiency and job performance. While server sharing among jobs improves resource utilization, interference among co-located DL jobs…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-28 Xiaoyang Zhao , Chuan Wu

HPSGD: Hierarchical Parallel SGD With Stale Gradients Featuring

While distributed training significantly speeds up the training process of the deep neural network (DNN), the utilization of the cluster is relatively low due to the time-consuming data synchronizing between workers. To alleviate this…

Machine Learning · Computer Science 2020-12-01 Yuhao Zhou , Qing Ye , Hailun Zhang , Jiancheng Lv

Blox: A Modular Toolkit for Deep Learning Schedulers

Deep Learning (DL) workloads have rapidly increased in popularity in enterprise clusters and several new cluster schedulers have been proposed in recent years to support these workloads. With rapidly evolving DL workloads, it is challenging…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-21 Saurabh Agarwal , Amar Phanishayee , Shivaram Venkataraman

Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

Dramatic increases in the capabilities of neural network models in recent years are driven by scaling model size, training data, and corresponding computational resources. To develop the exceedingly large networks required in modern…

Machine Learning · Computer Science 2025-04-15 Jared Fernandez , Luca Wehrstedt , Leonid Shamis , Mostafa Elhoushi , Kalyan Saladi , Yonatan Bisk , Emma Strubell , Jacob Kahn