Related papers: A Model Parallel Proximal Stochastic Gradient Algo…

Asynchronous Stochastic Proximal Methods for Nonconvex Nonsmooth Optimization

We study stochastic algorithms for solving nonconvex optimization problems with a convex yet possibly nonsmooth regularizer, which find wide applications in many practical machine learning applications. However, compared to asynchronous…

Machine Learning · Computer Science 2018-09-18 Rui Zhu , Di Niu , Zongpeng Li

Second-Order Convergence of Asynchronous Parallel Stochastic Gradient Descent: When Is the Linear Speedup Achieved?

In machine learning, asynchronous parallel stochastic gradient descent (APSGD) is broadly used to speed up the training process through multi-workers. Meanwhile, the time delay of stale gradients in asynchronous algorithms is generally…

Machine Learning · Computer Science 2020-06-09 Lifu Wang , Bo Shen , Ning Zhao

Asynchronous Parallel Stochastic Gradient Descent - A Numeric Core for Scalable Distributed Machine Learning Algorithms

The implementation of a vast majority of machine learning (ML) algorithms boils down to solving a numerical optimization problem. In this context, Stochastic Gradient Descent (SGD) methods have long proven to provide good results, both in…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-10-06 Janis Keuper , Franz-Josef Pfreundt

Make Workers Work Harder: Decoupled Asynchronous Proximal Stochastic Gradient Descent

Asynchronous parallel optimization algorithms for solving large-scale machine learning problems have drawn significant attention from academia to industry recently. This paper proposes a novel algorithm, decoupled asynchronous proximal…

Optimization and Control · Mathematics 2016-05-24 Yitan Li , Linli Xu , Xiaowei Zhong , Qing Ling

Asynchronous Decentralized Parallel Stochastic Gradient Descent

Most commonly used distributed machine learning systems are either synchronous or centralized asynchronous. Synchronous algorithms like AllReduce-SGD perform poorly in a heterogeneous environment, while asynchronous algorithms using a…

Optimization and Control · Mathematics 2018-09-26 Xiangru Lian , Wei Zhang , Ce Zhang , Ji Liu

Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates

The increasing size of deep learning models has made distributed training across multiple devices essential. However, current methods such as distributed data-parallel training suffer from large communication and synchronization overheads…

Machine Learning · Computer Science 2025-02-10 Cabrel Teguemne Fokam , Khaleelulla Khan Nazeer , Lukas König , David Kappel , Anand Subramoney

Asynchronous Stochastic Proximal Optimization Algorithms with Variance Reduction

Regularized empirical risk minimization (R-ERM) is an important branch of machine learning, since it constrains the capacity of the hypothesis space and guarantees the generalization ability of the learning algorithm. Two classic proximal…

Machine Learning · Computer Science 2016-09-28 Qi Meng , Wei Chen , Jingcheng Yu , Taifeng Wang , Zhi-Ming Ma , Tie-Yan Liu

Fast Asynchronous Parallel Stochastic Gradient Decent

Stochastic gradient descent~(SGD) and its variants have become more and more popular in machine learning due to their efficiency and effectiveness. To handle large-scale problems, researchers have recently proposed several parallel SGD…

Machine Learning · Statistics 2015-08-25 Shen-Yi Zhao , Wu-Jun Li

Pseudo-Asynchronous Local SGD: Robust and Efficient Data-Parallel Training

Following AI scaling trends, frontier models continue to grow in size and continue to be trained on larger datasets. Training these models requires huge investments in exascale computational resources, which has in turn driven developtment…

Machine Learning · Computer Science 2025-09-18 Hiroki Naganuma , Xinzhi Zhang , Man-Chung Yue , Ioannis Mitliagkas , Philipp A. Witte , Russell J. Hewett , Yin Tat Lee

Guided parallelized stochastic gradient descent for delay compensation

Stochastic gradient descent (SGD) algorithm and its variations have been effectively used to optimize neural network models. However, with the rapid growth of big data and deep learning, SGD is no longer the most suitable choice due to its…

Machine Learning · Computer Science 2024-02-13 Anuraganand Sharma

Parallel and distributed asynchronous adaptive stochastic gradient methods

Stochastic gradient methods (SGMs) are the predominant approaches to train deep learning models. The adaptive versions (e.g., Adam and AMSGrad) have been extensively used in practice, partly because they achieve faster convergence than the…

Optimization and Control · Mathematics 2022-04-14 Yangyang Xu , Yibo Xu , Yonggui Yan , Colin Sutcher-Shepard , Leopold Grinberg , Jie Chen

Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization

Asynchronous parallel implementations of stochastic gradient (SG) have been broadly used in solving deep neural network and received many successes in practice recently. However, existing theories cannot explain their convergence and…

Optimization and Control · Mathematics 2019-04-22 Xiangru Lian , Yijun Huang , Yuncheng Li , Ji Liu

MindTheStep-AsyncPSGD: Adaptive Asynchronous Parallel Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is very useful in optimization problems with high-dimensional non-convex target functions, and hence constitutes an important component of several Machine Learning and Data Analytics methods. Recently there…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-11-11 Karl Bäckström , Marina Papatriantafilou , Philippas Tsigas

Parallel Stochastic Gradient Descent with Sound Combiners

Stochastic gradient descent (SGD) is a well known method for regression and classification tasks. However, it is an inherently sequential algorithm at each step, the processing of the current example depends on the parameters learned from…

Machine Learning · Computer Science 2017-05-24 Saeed Maleki , Madanlal Musuvathi , Todd Mytkowicz

Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients

Recent work has established an empirically successful framework for adapting learning rates for stochastic gradient descent (SGD). This effectively removes all needs for tuning, while automatically reducing learning rates over time on…

Machine Learning · Computer Science 2013-03-28 Tom Schaul , Yann LeCun

DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging

The state-of-the-art deep learning algorithms rely on distributed training systems to tackle the increasing sizes of models and training data sets. Minibatch stochastic gradient descent (SGD) algorithm requires workers to halt forward/back…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-02 Qinggang Zhou , Yawen Zhang , Pengcheng Li , Xiaoyong Liu , Jun Yang , Runsheng Wang , Ru Huang

Asynchronous Parallel Stochastic Quasi-Newton Methods

Although first-order stochastic algorithms, such as stochastic gradient descent, have been the main force to scale up machine learning models, such as deep neural nets, the second-order quasi-Newton methods start to draw attention due to…

Optimization and Control · Mathematics 2020-11-03 Qianqian Tong , Guannan Liang , Xingyu Cai , Chunjiang Zhu , Jinbo Bi

ABS-SGD: A Delayed Synchronous Stochastic Gradient Descent Algorithm with Adaptive Batch Size for Heterogeneous GPU Clusters

As the size of models and datasets grows, it has become increasingly common to train models in parallel. However, existing distributed stochastic gradient descent (SGD) algorithms suffer from insufficient utilization of computational…

Machine Learning · Computer Science 2023-08-30 Xin Zhou , Ling Chen , Houming Wu

Online Learning Under A Separable Stochastic Approximation Framework

We propose an online learning algorithm for a class of machine learning models under a separable stochastic approximation framework. The essence of our idea lies in the observation that certain parameters in the models are easier to…

Machine Learning · Computer Science 2023-05-23 Min Gan , Xiang-xiang Su , Guang-yong Chen , Jing Chen

Decoupled Asynchronous Proximal Stochastic Gradient Descent with Variance Reduction

In the era of big data, optimizing large scale machine learning problems becomes a challenging task and draws significant attention. Asynchronous optimization algorithms come out as a promising solution. Recently, decoupled asynchronous…

Machine Learning · Computer Science 2016-09-30 Zhouyuan Huo , Bin Gu , Heng Huang