Related papers: RTP: Rethinking Tensor Parallelism with Memory Ded…

ATP: Adaptive Tensor Parallelism for Foundation Models

Foundation models have impressive performance and generalization capabilities across a wide range of applications. The increasing size of the models introduces great challenges for the training. Tensor parallelism is a critical technique…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-23 Shenggan Cheng , Ziming Liu , Jiangsu Du , Yang You

Folding Tensor and Sequence Parallelism for Memory-Efficient Transformer Training & Inference

We present tensor and sequence parallelism (TSP), a parallel execution strategy that folds tensor parallelism and sequence parallelism onto a single device axis. In conventional multi-dimensional parallelism layouts, tensor parallelism (TP)…

Computation and Language · Computer Science 2026-04-30 Vasu Shyam , Anna Golubeva , Quentin Anthony

OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning

Large-scale deep learning models contribute to significant performance improvements on varieties of downstream tasks. Current data and model parallelism approaches utilize model replication and partition techniques to support the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-22 Youhe Jiang , Fangcheng Fu , Xupeng Miao , Xiaonan Nie , Bin Cui

ZeroPP: Unleashing Exceptional Parallelism Efficiency through Tensor-Parallelism-Free Methodology

Large-scale models rely heavily on 3D parallelism for distributed training, which utilizes tensor parallelism (TP) as the intra-operator parallelism to partition model states across GPUs. However, TP introduces significant communication…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-27 Ding Tang , Lijuan Jiang , Jiecheng Zhou , Minxi Jin , Hengjie Li , Xingcheng Zhang , Zhilin Pei , Jidong Zhai

OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning

Large-scale deep learning models contribute to significant performance improvements on varieties of downstream tasks. Current data and model parallelism approaches utilize model replication and partition techniques to support the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-19 Youhe Jiang , Fangcheng Fu , Xupeng Miao , Xiaonan Nie , Bin Cui

RPC Considered Harmful: Fast Distributed Deep Learning on RDMA

Deep learning emerges as an important new resource-intensive workload and has been successfully applied in computer vision, speech, natural language processing, and so on. Distributed deep learning is becoming a necessity to cope with…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-23 Jilong Xue , Youshan Miao , Cheng Chen , Ming Wu , Lintao Zhang , Lidong Zhou

Learning in the Machine: Random Backpropagation and the Deep Learning Channel

Random backpropagation (RBP) is a variant of the backpropagation algorithm for training neural networks, where the transpose of the forward matrices are replaced by fixed random matrices in the calculation of the weight updates. It is…

Machine Learning · Computer Science 2017-12-25 Pierre Baldi , Peter Sadowski , Zhiqin Lu

TAPAS: Fast and Automatic Derivation of Tensor Parallel Strategies for Large Neural Networks

Tensor parallelism is an essential technique for distributed training of large neural networks. However, automatically determining an optimal tensor parallel strategy is challenging due to the gigantic search space, which grows…

Machine Learning · Computer Science 2025-08-06 Ziji Shi , Le Jiang , Ang Wang , Jie Zhang , Chencan Wu , Yong Li , Xiaokui Xiao , Wei Lin , Jialin Li

Efficient Real Time Recurrent Learning through combined activity and parameter sparsity

Backpropagation through time (BPTT) is the standard algorithm for training recurrent neural networks (RNNs), which requires separate simulation phases for the forward and backward passes for inference and learning, respectively. Moreover,…

Machine Learning · Computer Science 2023-03-13 Anand Subramoney

On the Performance and Memory Footprint of Distributed Training: An Empirical Study on Transformers

Transformer models have emerged as potent solutions to a wide array of multidisciplinary challenges. The deployment of Transformer architectures is significantly hindered by their extensive computational and memory requirements,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-03 Zhengxian Lu , Fangyu Wang , Zhiwei Xu , Fei Yang , Tao Li

RPN: A Residual Pooling Network for Efficient Federated Learning

Federated learning is a distributed machine learning framework which enables different parties to collaboratively train a model while protecting data privacy and security. Due to model complexity, network unreliability and connection…

Machine Learning · Computer Science 2020-04-08 Anbu Huang , Yuanyuan Chen , Yang Liu , Tianjian Chen , Qiang Yang

A Fully Tensorized Recurrent Neural Network

Recurrent neural networks (RNNs) are powerful tools for sequential modeling, but typically require significant overparameterization and regularization to achieve optimal performance. This leads to difficulties in the deployment of large…

Machine Learning · Computer Science 2021-11-11 Charles C. Onu , Jacob E. Miller , Doina Precup

TNT: Improving Chunkwise Training for Test-Time Memorization

Recurrent neural networks (RNNs) with deep test-time memorization modules, such as Titans and TTT, represent a promising, linearly-scaling paradigm distinct from Transformers. While these expressive models do not yet match the peak…

Machine Learning · Computer Science 2025-11-11 Zeman Li , Ali Behrouz , Yuan Deng , Peilin Zhong , Praneeth Kacham , Mahdi Karami , Meisam Razaviyayn , Vahab Mirrokni

A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks

The field of deep learning has witnessed a remarkable shift towards extremely compute- and memory-intensive neural networks. These newer larger models have enabled researchers to advance state-of-the-art tools across a variety of fields.…

Machine Learning · Computer Science 2022-07-04 Daniel Nichols , Siddharth Singh , Shu-Huai Lin , Abhinav Bhatele

Shared Memory Parallelization of MTTKRP for Dense Tensors

The matricized-tensor times Khatri-Rao product (MTTKRP) is the computational bottleneck for algorithms computing CP decompositions of tensors. In this paper, we develop shared-memory parallel algorithms for MTTKRP involving dense tensors.…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-08-31 Koby Hayashi , Grey Ballard , Jeffrey Jiang , Michael Tobia

DawnPiper: A Memory-scablable Pipeline Parallel Training Framework

Pipeline parallelism is a crucial paradigm for large-scale model training. However, imbalances in memory footprint across stages can lead to significant GPU memory wastage, limiting the model sizes that pipeline parallelism can effectively…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-12 Xuan Peng , Xuanhua Shi , Haolin Zhang , Yunfei Zhao , Xuehai Qian

TensorOpt: Exploring the Tradeoffs in Distributed DNN Training with Auto-Parallelism

A good parallelization strategy can significantly improve the efficiency or reduce the cost for the distributed training of deep neural networks (DNNs). Recently, several methods have been proposed to find efficient parallelization…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-01-12 Zhenkun Cai , Kaihao Ma , Xiao Yan , Yidi Wu , Yuzhen Huang , James Cheng , Teng Su , Fan Yu

Parallel Algorithms for Tensor Train Arithmetic

We present efficient and scalable parallel algorithms for performing mathematical operations for low-rank tensors represented in the tensor train (TT) format. We consider algorithms for addition, elementwise multiplication, computing norms…

Numerical Analysis · Mathematics 2021-09-08 Hussam Al Daas , Grey Ballard , Peter Benner

Model Parallelism With Subnetwork Data Parallelism

Pre-training large neural networks at scale imposes heavy memory demands on accelerators and often requires costly communication. We introduce Subnetwork Data Parallelism (SDP), a distributed training framework that partitions a model into…

Machine Learning · Computer Science 2025-10-06 Vaibhav Singh , Zafir Khalid , Edouard Oyallon , Eugene Belilovsky

Tensor-Train Recurrent Neural Networks for Video Classification

The Recurrent Neural Networks and their variants have shown promising performances in sequence modeling tasks such as Natural Language Processing. These models, however, turn out to be impractical and difficult to train when exposed to very…

Computer Vision and Pattern Recognition · Computer Science 2017-07-07 Yinchong Yang , Denis Krompass , Volker Tresp