Related papers: Automatic Operator-level Parallelism Planning for …

OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning

Large-scale deep learning models contribute to significant performance improvements on varieties of downstream tasks. Current data and model parallelism approaches utilize model replication and partition techniques to support the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-19 Youhe Jiang , Fangcheng Fu , Xupeng Miao , Xiaonan Nie , Bin Cui

OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning

Large-scale deep learning models contribute to significant performance improvements on varieties of downstream tasks. Current data and model parallelism approaches utilize model replication and partition techniques to support the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-22 Youhe Jiang , Fangcheng Fu , Xupeng Miao , Xiaonan Nie , Bin Cui

A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks

The field of deep learning has witnessed a remarkable shift towards extremely compute- and memory-intensive neural networks. These newer larger models have enabled researchers to advance state-of-the-art tools across a variety of fields.…

Machine Learning · Computer Science 2022-07-04 Daniel Nichols , Siddharth Singh , Shu-Huai Lin , Abhinav Bhatele

Distributed Hybrid Parallelism for Large Language Models: Comparative Study and System Design Guide

With the rapid growth of large language models (LLMs), a wide range of methods have been developed to distribute computation and memory across hardware devices for efficient training and inference. While existing surveys provide descriptive…

Machine Learning · Computer Science 2026-02-11 Hossam Amer , Rezaul Karim , Ali Pourranjbar , Weiwei Zhang , Walid Ahmed , Boxing Chen

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-03 Shiqing Fan , Yi Rong , Chen Meng , Zongyan Cao , Siyu Wang , Zhen Zheng , Chuan Wu , Guoping Long , Jun Yang , Lixue Xia , Lansong Diao , Xiaoyong Liu , Wei Lin

A Linear Algebraic Approach to Model Parallelism in Deep Learning

Training deep neural networks (DNNs) in large-cluster computing environments is increasingly necessary, as networks grow in size and complexity. Local memory and processing limitations require robust data and model parallelism for crossing…

Machine Learning · Computer Science 2020-06-08 Russell J. Hewett , Thomas J. Grady

Multi-Resource Parallel Query Scheduling and Optimization

Scheduling query execution plans is a particularly complex problem in shared-nothing parallel systems, where each site consists of a collection of local time-shared (e.g., CPU(s) or disk(s)) and space-shared (e.g., memory) resources and…

Databases · Computer Science 2014-04-01 Minos Garofalakis , Yannis Ioannidis

Rethinking Dynamic Networks and Heterogeneous Computing with Automatic Parallelization

Hybrid parallelism techniques are essential for efficiently training large language models (LLMs). Nevertheless, current automatic parallel planning frameworks often overlook the simultaneous consideration of node heterogeneity and dynamic…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-04 Ruilong Wu , Xinjiao Li , Yisu Wang , Xinyu Chen , Dirk Kutscher

Distributed Mixed-Integer Linear Programming via Cut Generation and Constraint Exchange

Many problems of interest for cyber-physical network systems can be formulated as Mixed-Integer Linear Programs in which the constraints are distributed among the agents. In this paper we propose a distributed algorithmic framework to solve…

Optimization and Control · Mathematics 2019-06-05 Andrea Testa , Alessandro Rucco , Giuseppe Notarstefano

Parallelizing Optimal Multiple Sequence Alignment by Dynamic Programming

Optimal multiple sequence alignment by dynamic programming, like many highly dimensional scientific computing problems, has failed to benefit from the improvements in computing performance brought about by multi-processor systems, due to…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-30 Manal Helal , Hossam El-Gindy , Lenore Mullin , Bruno Gaeta

Automatically Planning Optimal Parallel Strategy for Large Language Models

The number of parameters in large-scale language models based on transformers is gradually increasing, and the scale of computing clusters is also growing. The technology of quickly mobilizing large amounts of computing resources for…

Artificial Intelligence · Computer Science 2025-01-03 Zongbiao Li , Xiezhao Li , Yinghao Cui , Yijun Chen , Zhixuan Gu , Yuxuan Liu , Wenbo Zhu , Fei Jia , Ke Liu , Qifeng Li , Junyao Zhan , Jiangtao Zhou , Chenxi Zhang , Qike Liu

Parallelizing Query Optimization on Shared-Nothing Architectures

Data processing systems offer an ever increasing degree of parallelism on the levels of cores, CPUs, and processing nodes. Query optimization must exploit high degrees of parallelism in order not to gradually become the bottleneck of query…

Databases · Computer Science 2015-11-06 Immanuel Trummer , Christoph Koch

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we…

Machine Learning · Computer Science 2018-09-18 Tal Ben-Nun , Torsten Hoefler

Efficient Pipeline Planning for Expedited Distributed DNN Training

To train modern large DNN models, pipeline parallelism has recently emerged, which distributes the model across GPUs and enables different devices to process different microbatches in pipeline. Earlier pipeline designs allow multiple…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-23 Ziyue Luo , Xiaodong Yi , Guoping Long , Shiqing Fan , Chuan Wu , Jun Yang , Wei Lin

Scaling Studies for Efficient Parameter Search and Parallelism for Large Language Model Pre-training

AI accelerator processing capabilities and memory constraints largely dictate the scale in which machine learning workloads (e.g., training and inference) can be executed within a desirable time frame. Training a state of the art,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-12 Michael Benington , Leo Phan , Chris Pierre Paul , Evan Shoemaker , Priyanka Ranade , Torstein Collett , Grant Hodgson Perez , Christopher Krieger

Parallel Training of Deep Networks with Local Updates

Deep learning models trained on large data sets have been widely successful in both vision and language domains. As state-of-the-art deep learning architectures have continued to grow in parameter count so have the compute budgets and times…

Machine Learning · Computer Science 2021-06-16 Michael Laskin , Luke Metz , Seth Nabarro , Mark Saroufim , Badreddine Noune , Carlo Luschi , Jascha Sohl-Dickstein , Pieter Abbeel

Online Job Scheduling in Distributed Machine Learning Clusters

Nowadays large-scale distributed machine learning systems have been deployed to support various analytics and intelligence services in IT firms. To train a large dataset and derive the prediction/inference model, e.g., a deep neural…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-04 Yixin Bao , Yanghua Peng , Chuan Wu , Zongpeng Li

On the Design and Analysis of Parallel and Distributed Algorithms

Arrival of multicore systems has enforced a new scenario in computing, the parallel and distributed algorithms are fast replacing the older sequential algorithms, with many challenges of these techniques. The distributed algorithms provide…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-13 Rajendra Purohit , K R Chowdhary , S D Purohit

DuctTeip: An efficient programming model for distributed task based parallel computing

Current high-performance computer systems used for scientific computing typically combine shared memory computational nodes in a distributed memory environment. Extracting high performance from these complex systems requires tailored…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-14 Afshin Zafari , Elisabeth Larsson , Martin Tillenius

Multiprocessor Scheduling with Memory Constraints: Fundamental Properties and Finding Optimal Solutions

We study the problem of scheduling a general computational DAG on multiple processors in a 2-level memory hierarchy. This setting is a natural generalization of several prominent models in the literature, and it simultaneously captures…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-24 Pál András Papp , Toni Böhnlein , A. N. Yzelman