Related papers: Parle: parallelizing stochastic gradient descent

Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks

The employment of high-performance servers and GPU accelerators for training deep neural network models have greatly accelerated recent advances in deep learning (DL). DL frameworks, such as TensorFlow, MXNet, and Caffe2, have emerged to…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-06-11 Soojeong Kim , Gyeong-In Yu , Hojin Park , Sungwoo Cho , Eunji Jeong , Hyeonmin Ha , Sanha Lee , Joo Seong Jeong , Byung-Gon Chun

Two-dimensional Sparse Parallelism for Large Scale Deep Learning Recommendation Model Training

The increasing complexity of deep learning recommendation models (DLRM) has led to a growing need for large-scale distributed systems that can efficiently train vast amounts of data. In DLRM, the sparse embedding table is a crucial…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-07 Xin Zhang , Quanyu Zhu , Liangbei Xu , Zain Huda , Wang Zhou , Jin Fang , Dennis van der Staay , Yuxi Hu , Jade Nie , Jiyan Yang , Chunzhi Yang

PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation

The autoregressive nature of large language models (LLMs) fundamentally limits inference speed, as each forward pass generates only a single token and is often bottlenecked by memory bandwidth. Speculative decoding has emerged as a…

Machine Learning · Computer Science 2025-12-02 Zihao An , Huajun Bai , Ziqiong Liu , Dong Li , Emad Barsoum

Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism

Scaling models has led to significant advancements in deep learning, but training these models in decentralized settings remains challenging due to communication bottlenecks. While existing compression techniques are effective in…

Machine Learning · Computer Science 2025-06-03 Sameera Ramasinghe , Thalaiyasingam Ajanthan , Gil Avraham , Yan Zuo , Alexander Long

SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

Many deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. In this work, we consider alternative setups for…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-30 Max Ryabinin , Tim Dettmers , Michael Diskin , Alexander Borzunov

Pseudo-Asynchronous Local SGD: Robust and Efficient Data-Parallel Training

Following AI scaling trends, frontier models continue to grow in size and continue to be trained on larger datasets. Training these models requires huge investments in exascale computational resources, which has in turn driven developtment…

Machine Learning · Computer Science 2025-09-18 Hiroki Naganuma , Xinzhi Zhang , Man-Chung Yue , Ioannis Mitliagkas , Philipp A. Witte , Russell J. Hewett , Yin Tat Lee

SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs

In large-scale LLM pre-training systems with 100k+ GPUs, failures become the norm rather than the exception, and restart costs can dominate wall-clock training time. However, existing fault-tolerance mechanisms are largely unprepared for…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-29 Jin Lee , Zhonghao Chen , Xuhang He , Robert Underwood , Bogdan Nicolae , Franck Cappello , Xiaoyi Lu , Sheng Di , Zheng Zhang

EventGraD: Event-Triggered Communication in Parallel Machine Learning

Communication in parallel systems imposes significant overhead which often turns out to be a bottleneck in parallel machine learning. To relieve some of this overhead, in this paper, we present EventGraD - an algorithm with event-triggered…

Machine Learning · Computer Science 2021-12-10 Soumyadip Ghosh , Bernardo Aquino , Vijay Gupta

Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning

Deep learning is a popular machine learning technique and has been applied to many real-world problems. However, training a deep neural network is very time-consuming, especially on big data. It has become difficult for a single machine to…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-04 Xing Zhao , Aijun An , Junfeng Liu , Bao Xin Chen

Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning

In distributed training of deep neural networks, parallel mini-batch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradient in parallel, aggregates all…

Optimization and Control · Mathematics 2018-11-19 Hao Yu , Sen Yang , Shenghuo Zhu

Integrated Model, Batch and Domain Parallelism in Training Neural Networks

We propose a new integrated method of exploiting model, batch and domain parallelism for the training of deep neural networks (DNNs) on large distributed-memory computers using minibatch stochastic gradient descent (SGD). Our goal is to…

Machine Learning · Computer Science 2018-05-17 Amir Gholami , Ariful Azad , Peter Jin , Kurt Keutzer , Aydin Buluc

Parallel Training of Deep Networks with Local Updates

Deep learning models trained on large data sets have been widely successful in both vision and language domains. As state-of-the-art deep learning architectures have continued to grow in parameter count so have the compute budgets and times…

Machine Learning · Computer Science 2021-06-16 Michael Laskin , Luke Metz , Seth Nabarro , Mark Saroufim , Badreddine Noune , Carlo Luschi , Jascha Sohl-Dickstein , Pieter Abbeel

How to scale distributed deep learning?

Training time on large datasets for deep neural networks is the principal workflow bottleneck in a number of important applications of deep learning, such as object classification and detection in automatic driver assistance systems (ADAS).…

Machine Learning · Computer Science 2016-11-15 Peter H. Jin , Qiaochu Yuan , Forrest Iandola , Kurt Keutzer

Gear Training: A new way to implement high-performance model-parallel training

The training of Deep Neural Networks usually needs tremendous computing resources. Therefore many deep models are trained in large cluster instead of single machine or GPU. Though major researchs at present try to run whole model on all…

Machine Learning · Computer Science 2018-06-12 Hao Dong , Shuai Li , Dongchang Xu , Yi Ren , Di Zhang

Exploiting Sparsity in Pruned Neural Networks to Optimize Large Model Training

Parallel training of neural networks at scale is challenging due to significant overheads arising from communication. Recently, deep learning researchers have developed a variety of pruning algorithms that are capable of pruning (i.e.…

Machine Learning · Computer Science 2023-05-16 Siddharth Singh , Abhinav Bhatele

A Practical Layer-Parallel Training Algorithm for Residual Networks

Gradient-based algorithms for training ResNets typically require a forward pass of the input data, followed by back-propagating the objective gradient to update parameters, which are time-consuming for deep ResNets. To break the…

Machine Learning · Computer Science 2021-02-19 Qi Sun , Hexin Dong , Zewei Chen , Weizhen Dian , Jiacheng Sun , Yitong Sun , Zhenguo Li , Bin Dong

Cyclic Data Parallelism for Efficient Parallelism of Deep Neural Networks

Training large deep learning models requires parallelization techniques to scale. In existing methods such as Data Parallelism or ZeRO-DP, micro-batches of data are processed in parallel, which creates two drawbacks: the total memory…

Machine Learning · Computer Science 2024-03-15 Louis Fournier , Edouard Oyallon

Sparse Communication for Training Deep Networks

Synchronous stochastic gradient descent (SGD) is the most common method used for distributed training of deep learning models. In this algorithm, each worker shares its local gradients with others and updates the parameters using the…

Machine Learning · Computer Science 2020-09-22 Negar Foroutan Eghlidi , Martin Jaggi

A Data and Model-Parallel, Distributed and Scalable Framework for Training of Deep Networks in Apache Spark

Training deep networks is expensive and time-consuming with the training period increasing with data size and growth in model parameters. In this paper, we provide a framework for distributed training of deep networks over a cluster of CPUs…

Machine Learning · Statistics 2017-08-22 Disha Shrivastava , Santanu Chaudhury , Dr. Jayadeva

DC-S3GD: Delay-Compensated Stale-Synchronous SGD for Large-Scale Decentralized Neural Network Training

Data parallelism has become the de facto standard for training Deep Neural Network on multiple processing units. In this work we propose DC-S3GD, a decentralized (without Parameter Server) stale-synchronous version of the Delay-Compensated…

Machine Learning · Computer Science 2019-11-07 Alessandro Rigazzi