Related papers: A Model Parallel Proximal Stochastic Gradient Algo…
We study stochastic algorithms for solving nonconvex optimization problems with a convex yet possibly nonsmooth regularizer, which find wide applications in many practical machine learning applications. However, compared to asynchronous…
In machine learning, asynchronous parallel stochastic gradient descent (APSGD) is broadly used to speed up the training process through multi-workers. Meanwhile, the time delay of stale gradients in asynchronous algorithms is generally…
The implementation of a vast majority of machine learning (ML) algorithms boils down to solving a numerical optimization problem. In this context, Stochastic Gradient Descent (SGD) methods have long proven to provide good results, both in…
Asynchronous parallel optimization algorithms for solving large-scale machine learning problems have drawn significant attention from academia to industry recently. This paper proposes a novel algorithm, decoupled asynchronous proximal…
Most commonly used distributed machine learning systems are either synchronous or centralized asynchronous. Synchronous algorithms like AllReduce-SGD perform poorly in a heterogeneous environment, while asynchronous algorithms using a…
The increasing size of deep learning models has made distributed training across multiple devices essential. However, current methods such as distributed data-parallel training suffer from large communication and synchronization overheads…
Regularized empirical risk minimization (R-ERM) is an important branch of machine learning, since it constrains the capacity of the hypothesis space and guarantees the generalization ability of the learning algorithm. Two classic proximal…
Stochastic gradient descent~(SGD) and its variants have become more and more popular in machine learning due to their efficiency and effectiveness. To handle large-scale problems, researchers have recently proposed several parallel SGD…
Following AI scaling trends, frontier models continue to grow in size and continue to be trained on larger datasets. Training these models requires huge investments in exascale computational resources, which has in turn driven developtment…
Stochastic gradient descent (SGD) algorithm and its variations have been effectively used to optimize neural network models. However, with the rapid growth of big data and deep learning, SGD is no longer the most suitable choice due to its…
Stochastic gradient methods (SGMs) are the predominant approaches to train deep learning models. The adaptive versions (e.g., Adam and AMSGrad) have been extensively used in practice, partly because they achieve faster convergence than the…
Asynchronous parallel implementations of stochastic gradient (SG) have been broadly used in solving deep neural network and received many successes in practice recently. However, existing theories cannot explain their convergence and…
Stochastic Gradient Descent (SGD) is very useful in optimization problems with high-dimensional non-convex target functions, and hence constitutes an important component of several Machine Learning and Data Analytics methods. Recently there…
Stochastic gradient descent (SGD) is a well known method for regression and classification tasks. However, it is an inherently sequential algorithm at each step, the processing of the current example depends on the parameters learned from…
Recent work has established an empirically successful framework for adapting learning rates for stochastic gradient descent (SGD). This effectively removes all needs for tuning, while automatically reducing learning rates over time on…
The state-of-the-art deep learning algorithms rely on distributed training systems to tackle the increasing sizes of models and training data sets. Minibatch stochastic gradient descent (SGD) algorithm requires workers to halt forward/back…
Although first-order stochastic algorithms, such as stochastic gradient descent, have been the main force to scale up machine learning models, such as deep neural nets, the second-order quasi-Newton methods start to draw attention due to…
As the size of models and datasets grows, it has become increasingly common to train models in parallel. However, existing distributed stochastic gradient descent (SGD) algorithms suffer from insufficient utilization of computational…
We propose an online learning algorithm for a class of machine learning models under a separable stochastic approximation framework. The essence of our idea lies in the observation that certain parameters in the models are easier to…
In the era of big data, optimizing large scale machine learning problems becomes a challenging task and draws significant attention. Asynchronous optimization algorithms come out as a promising solution. Recently, decoupled asynchronous…