Related papers: DSP: Dynamic Sequence Parallelism for Multi-Dimens…

Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning

Deep learning is a popular machine learning technique and has been applied to many real-world problems. However, training a deep neural network is very time-consuming, especially on big data. It has become difficult for a single machine to…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-04 Xing Zhao , Aijun An , Junfeng Liu , Bao Xin Chen

Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads

Efficient parallelism is necessary for achieving low-latency, high-throughput inference with large language models (LLMs). Tensor parallelism (TP) is the state-of-the-art method for reducing LLM response latency, however GPU communications…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-27 Mert Hidayetoglu , Aurick Qiao , Michael Wyatt , Jeff Rasley , Yuxiong He , Samyam Rajbhandari

Slim-DP: A Light Communication Data Parallelism for DNN

Data parallelism has emerged as a necessary technique to accelerate the training of deep neural networks (DNN). In a typical data parallelism approach, the local workers push the latest updates of all the parameters to the parameter server…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-09-28 Shizhao Sun , Wei Chen , Jiang Bian , Xiaoguang Liu , Tie-Yan Liu

DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism

Scaling long-context capabilities is crucial for Multimodal Large Language Models (MLLMs). However, real-world multimodal datasets are extremely heterogeneous. Existing training frameworks predominantly rely on static parallelism…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-26 Yifan Niu , Han Xiao , Dongyi Liu , Wei Zhou , Jia Li

Two-dimensional Sparse Parallelism for Large Scale Deep Learning Recommendation Model Training

The increasing complexity of deep learning recommendation models (DLRM) has led to a growing need for large-scale distributed systems that can efficiently train vast amounts of data. In DLRM, the sparse embedding table is a crucial…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-07 Xin Zhang , Quanyu Zhu , Liangbei Xu , Zain Huda , Wang Zhou , Jin Fang , Dennis van der Staay , Yuxi Hu , Jade Nie , Jiyan Yang , Chunzhi Yang

Folding Tensor and Sequence Parallelism for Memory-Efficient Transformer Training & Inference

We present tensor and sequence parallelism (TSP), a parallel execution strategy that folds tensor parallelism and sequence parallelism onto a single device axis. In conventional multi-dimensional parallelism layouts, tensor parallelism (TP)…

Computation and Language · Computer Science 2026-04-30 Vasu Shyam , Anna Golubeva , Quentin Anthony

SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction

With the growing model size, deep neural networks (DNN) are increasingly trained over massive GPU accelerators, which demands a proper parallelization plan that transforms a DNN model into fine-grained tasks and then schedules them to GPUs…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-24 Zhiqi Lin , Youshan Miao , Guodong Liu , Xiaoxiang Shi , Quanlu Zhang , Fan Yang , Saeed Maleki , Yi Zhu , Xu Cao , Cheng Li , Mao Yang , Lintao Zhang , Lidong Zhou

USP: A Unified Sequence Parallelism Approach for Long Context Generative AI

Sequence parallelism (SP), which divides the sequence dimension of input tensors across multiple computational devices, is becoming key to unlocking the long-context capabilities of generative AI models. This paper investigates the…

Machine Learning · Computer Science 2024-07-03 Jiarui Fang , Shangchun Zhao

Parallelizing Optimal Multiple Sequence Alignment by Dynamic Programming

Optimal multiple sequence alignment by dynamic programming, like many highly dimensional scientific computing problems, has failed to benefit from the improvements in computing performance brought about by multi-processor systems, due to…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-30 Manal Helal , Hossam El-Gindy , Lenore Mullin , Bruno Gaeta

Model Parallelism With Subnetwork Data Parallelism

Pre-training large neural networks at scale imposes heavy memory demands on accelerators and often requires costly communication. We introduce Subnetwork Data Parallelism (SDP), a distributed training framework that partitions a model into…

Machine Learning · Computer Science 2025-10-06 Vaibhav Singh , Zafir Khalid , Edouard Oyallon , Eugene Belilovsky

Sequence Parallelism: Long Sequence Training from System Perspective

Transformer achieves promising results on various tasks. However, self-attention suffers from quadratic memory requirements with respect to the sequence length. Existing work focuses on reducing time and space complexity from an algorithm…

Machine Learning · Computer Science 2022-05-24 Shenggui Li , Fuzhao Xue , Chaitanya Baranwal , Yongbin Li , Yang You

Tesseract: Parallelize the Tensor Parallelism Efficiently

Together with the improvements in state-of-the-art accuracies of various tasks, deep learning models are getting significantly larger. However, it is extremely difficult to implement these large models because limited GPU memory makes it…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-02 Boxiang Wang , Qifan Xu , Zhengda Bian , Yang You

Linear Attention Sequence Parallelism

Sequence parallelism (SP) serves as a prevalent strategy to handle long sequences that exceed the memory limit of a single device. However, for linear sequence modeling methods like linear attention, existing SP approaches do not take…

Machine Learning · Computer Science 2025-05-19 Weigao Sun , Zhen Qin , Dong Li , Xuyang Shen , Yu Qiao , Yiran Zhong

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Computation in a typical Transformer-based large language model (LLM) can be characterized by batch size, hidden dimension, number of layers, and sequence length. Until now, system works for accelerating LLM training have focused on the…

Machine Learning · Computer Science 2023-10-05 Sam Ade Jacobs , Masahiro Tanaka , Chengming Zhang , Minjia Zhang , Shuaiwen Leon Song , Samyam Rajbhandari , Yuxiong He

Modeling GPU Dynamic Parallelism for Self Similar Density Workloads

Dynamic Parallelism (DP) is a runtime feature of the GPU programming model that allows GPU threads to execute additional GPU kernels, recursively. Apart from making the programming of parallel hierarchical patterns easier, DP can also…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-07 Felipe A. Quezada , Cristóbal A. Navarro , Miguel Romero , Cristhian Aguilera

OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning

Large-scale deep learning models contribute to significant performance improvements on varieties of downstream tasks. Current data and model parallelism approaches utilize model replication and partition techniques to support the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-19 Youhe Jiang , Fangcheng Fu , Xupeng Miao , Xiaonan Nie , Bin Cui

OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning

Large-scale deep learning models contribute to significant performance improvements on varieties of downstream tasks. Current data and model parallelism approaches utilize model replication and partition techniques to support the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-22 Youhe Jiang , Fangcheng Fu , Xupeng Miao , Xiaonan Nie , Bin Cui

Optimizing Long-context LLM Serving via Fine-grained Sequence Parallelism

With the advancement of large language models (LLMs), their context windows have rapidly expanded. To meet diverse demands from varying-length requests in online services, existing state-of-the-art systems tune the sequence parallelism (SP)…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-20 Cong Li , Yuzhe Yang , Xuegui Zheng , Qifan Yang , Yijin Guan , Size Zheng , Li-Wen Chang , Shufan Liu , Xin Liu , Guangyu Sun

Cyclic Data Parallelism for Efficient Parallelism of Deep Neural Networks

Training large deep learning models requires parallelization techniques to scale. In existing methods such as Data Parallelism or ZeRO-DP, micro-batches of data are processed in parallel, which creates two drawbacks: the total memory…

Machine Learning · Computer Science 2024-03-15 Louis Fournier , Edouard Oyallon

DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism

Context parallelism has emerged as a key technique to support long-context training, a growing trend in generative AI for modern large models. However, existing context parallel methods rely on static parallelization configurations that…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-14 Chenyu Jiang , Zhenkun Cai , Ye Tian , Zhen Jia , Yida Wang , Chuan Wu