Related papers: Scheduling optimization of parallel linear algebra…

A Scalable Shared-Memory Parallel Simplex for Large-Scale Linear Programming

The Simplex tableau has been broadly used and investigated in the industry and academia. With the advent of the big data era, ever larger problems are posed to be solved in ever larger machines whose architecture type did not exist in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-05-29 Demetrios Coutinho , Felipe O. Lins e Silva , Daniel Aloise , Samuel , Xavier-de-Souza

Automatic Operator-level Parallelism Planning for Distributed Deep Learning -- A Mixed-Integer Programming Approach

As the artificial intelligence community advances into the era of large models with billions of parameters, distributed training and inference have become essential. While various parallelism strategies-data, model, sequence, and…

Machine Learning · Computer Science 2025-03-13 Ruifeng She , Bowen Pang , Kai Li , Zehua Liu , Tao Zhong

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures: A Machine Learning Based Approach

This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-03-10 Peng Zhang , Jianbin Fang , Canqun Yang , Chun Huang , Tao Tang , Zheng Wang

Cache-aware Performance Modeling and Prediction for Dense Linear Algebra

Countless applications cast their computational core in terms of dense linear algebra operations. These operations can usually be implemented by combining the routines offered by standard linear algebra libraries such as BLAS and LAPACK,…

Performance · Computer Science 2014-10-01 Elmar Peise , Paolo Bientinesi

A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures

As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these…

Mathematical Software · Computer Science 2008-06-12 Alfredo Buttari , Julien Langou , Jakub Kurzak , Jack Dongarra

Learning to Shard: RL for Co-optimizing the Parallelism Degrees and Per-operator Sharding Dimensions in Distributed LLM Inference

Distributed LLM inference requires careful coordination of parallelization strategies across hundreds to thousands of NPUs to meet production SLOs. Current systems like Megatron-LM rely on static heuristics that separately configure…

Machine Learning · Computer Science 2025-09-03 Ruokai Yin , Sattwik Deb Mishra , Xuan Zuo , Hokchhay Tann , Preyas Shah , Apala Guha

A load balancing strategy for parallel computation of sparse permanents

The research in parallel machine scheduling in combinatorial optimization suggests that the desirable parallel efficiency could be achieved when the jobs are sorted in the non-increasing order of processing times. In this paper, we find…

Numerical Analysis · Mathematics 2012-02-15 Lei Wang , Heng Liang , Fengshan Bai , Yan Huo

Optimal Divisible Load Scheduling for Resource-Sharing Network

Scheduling is an important task allowing parallel systems to perform efficiently and reliably. For modern computation systems, divisible load is a special type of data which can be divided into arbitrary sizes and independently processed in…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-02-07 Fei Wu , Yang Cao , Thomas Robertazzi

Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks

The past few years have witnessed growth in the computational requirements for training deep convolutional neural networks. Current approaches parallelize training onto multiple devices by applying a single parallelization strategy (e.g.,…

Machine Learning · Computer Science 2018-06-12 Zhihao Jia , Sina Lin , Charles R. Qi , Alex Aiken

Parallel training of linear models without compromising convergence

In this paper we analyze, evaluate, and improve the performance of training generalized linear models on modern CPUs. We start with a state-of-the-art asynchronous parallel training algorithm, identify system-level performance bottlenecks,…

Machine Learning · Computer Science 2018-12-20 Nikolas Ioannou , Celestine Dünner , Kornilios Kourtis , Thomas Parnell

Automatically Planning Optimal Parallel Strategy for Large Language Models

The number of parameters in large-scale language models based on transformers is gradually increasing, and the scale of computing clusters is also growing. The technology of quickly mobilizing large amounts of computing resources for…

Artificial Intelligence · Computer Science 2025-01-03 Zongbiao Li , Xiezhao Li , Yinghao Cui , Yijun Chen , Zhixuan Gu , Yuxuan Liu , Wenbo Zhu , Fei Jia , Ke Liu , Qifeng Li , Junyao Zhan , Jiangtao Zhou , Chenxi Zhang , Qike Liu

Optimizing Distributed Training Approaches for Scaling Neural Networks

This paper presents a comparative analysis of distributed training strategies for large-scale neural networks, focusing on data parallelism, model parallelism, and hybrid approaches. We evaluate these strategies on image classification…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-01 Vishnu Vardhan Baligodugula , Fathi Amsaad

Online Distributed Scheduling on a Fault-prone Parallel System

We consider a parallel system of $m$ identical machines prone to unpredictable crashes and restarts, trying to cope with the continuous arrival of tasks to be executed. Tasks have different computational requirements (i.e., processing time…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-03-21 Elli Zavou , Antonio Fernández Anta

Efficient Parallelization Layouts for Large-Scale Distributed Model Training

Efficiently training large language models requires parallelizing across hundreds of hardware accelerators and invoking various compute and memory optimizations. When combined, many of these strategies have complex interactions regarding…

Machine Learning · Computer Science 2024-09-25 Johannes Hagemann , Samuel Weinbach , Konstantin Dobler , Maximilian Schall , Gerard de Melo

Machine Learning and CPU (Central Processing Unit) Scheduling Co-Optimization over a Network of Computing Centers

In the rapidly evolving research on artificial intelligence (AI) the demand for fast, computationally efficient, and scalable solutions has increased in recent years. The problem of optimizing the computing resources for distributed machine…

Machine Learning · Computer Science 2025-10-30 Mohammadreza Doostmohammadian , Zulfiya R. Gabidullina , Hamid R. Rabiee

Optimal (Randomized) Parallel Algorithms in the Binary-Forking Model

In this paper we develop optimal algorithms in the binary-forking model for a variety of fundamental problems, including sorting, semisorting, list ranking, tree contraction, range minima, and ordered set union, intersection and difference.…

Data Structures and Algorithms · Computer Science 2020-06-26 Guy E. Blelloch , Jeremy T. Fineman , Yan Gu , Yihan Sun

Scheduling Trees of Malleable Tasks for Sparse Linear Algebra

Scientific workloads are often described as directed acyclic task graphs. In this paper, we focus on the multifrontal factorization of sparse matrices, whose task graph is structured as a tree of parallel tasks. Among the existing models…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-06-05 Abdou Guermouche , Loris Marchal , Bertrand Simon , Frédéric Vivien

High-Performance Computing for Scheduling Decision Support: A Parallel Depth-First Search Heuristic

Many academic disciplines - including information systems, computer science, and operations management - face scheduling problems as important decision making tasks. Since many scheduling problems are NP-hard in the strong sense, there is a…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-05-26 Gerhard Rauchecker , Guido Schryen

Efficient allocation of image recognition and LLM tasks on multi-GPU system

This work is concerned with the evaluation of the performance of parallelization of learning and tuning processes for image classification and large language models. For machine learning model in image recognition, various parallelization…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-20 Marcin Lawenda , Krzesimir Samborski , Kyrylo Khloponin , Łukasz Szustak

Exploring Parallelism in Learning Belief Networks

It has been shown that a class of probabilistic domain models cannot be learned correctly by several existing algorithms which employ a single-link look ahead search. When a multi-link look ahead search is used, the computational complexity…

Artificial Intelligence · Computer Science 2013-02-08 TongSheng Chu , Yang Xiang