Related papers: Nesterov Method for Asynchronous Pipeline Parallel…

PipeMare: Asynchronous Pipeline Parallel DNN Training

Pipeline parallelism (PP) when training neural networks enables larger models to be partitioned spatially, leading to both lower network communication and overall higher hardware utilization. Unfortunately, to preserve the statistical…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-11 Bowen Yang , Jian Zhang , Jonathan Li , Christopher Ré , Christopher R. Aberger , Christopher De Sa

A More Stable Accelerated Gradient Method Inspired by Continuous-Time Perspective

Nesterov's accelerated gradient method (NAG) is widely used in problems with machine learning background including deep learning, and is corresponding to a continuous-time differential equation. From this connection, the property of the…

Optimization and Control · Mathematics 2022-04-05 Yasong Feng , Weiguo Gao

AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism

Data and pipeline parallelism are key strategies for scaling neural network training across distributed devices, but their high communication cost necessitates co-located computing clusters with fast interconnects, limiting their…

Machine Learning · Computer Science 2026-02-02 Thalaiyasingam Ajanthan , Sameera Ramasinghe , Gil Avraham , Hadi Mohaghegh Dolatabadi , Chamin P Hewa Koneputugodage , Violetta Shevchenko , Yan Zuo , Alexander Long

A Convergence Analysis of Nesterov's Accelerated Gradient Method in Training Deep Linear Neural Networks

Momentum methods, including heavy-ball~(HB) and Nesterov's accelerated gradient~(NAG), are widely used in training neural networks for their fast convergence. However, there is a lack of theoretical guarantees for their convergence and…

Machine Learning · Computer Science 2022-04-19 Xin Liu , Wei Tao , Zhisong Pan

Nesterov Accelerated Shuffling Gradient Method for Convex Optimization

In this paper, we propose Nesterov Accelerated Shuffling Gradient (NASG), a new algorithm for the convex finite-sum minimization problems. Our method integrates the traditional Nesterov's acceleration momentum with different shuffling…

Optimization and Control · Mathematics 2022-06-14 Trang H. Tran , Katya Scheinberg , Lam M. Nguyen

Provable Convergence of Nesterov's Accelerated Gradient Method for Over-Parameterized Neural Networks

Momentum methods, such as heavy ball method~(HB) and Nesterov's accelerated gradient method~(NAG), have been widely used in training neural networks by incorporating the history of gradients into the current updating process. In practice,…

Machine Learning · Computer Science 2022-04-19 Xin Liu , Zhisong Pan , Wei Tao

Theoretical Limits of Pipeline Parallel Optimization and Application to Distributed Deep Learning

We investigate the theoretical limits of pipeline parallel learning of deep learning architectures, a distributed setup in which the computation is distributed per layer instead of per example. For smooth convex and non-convex objective…

Machine Learning · Statistics 2019-10-14 Igor Colin , Ludovic Dos Santos , Kevin Scaman

Provable Acceleration of Nesterov's Accelerated Gradient for Rectangular Matrix Factorization and Linear Neural Networks

We study the convergence rate of first-order methods for rectangular matrix factorization, which is a canonical nonconvex optimization problem. Specifically, given a rank-$r$ matrix $\mathbf{A}\in\mathbb{R}^{m\times n}$, we prove that…

Machine Learning · Computer Science 2024-12-03 Zhenghao Xu , Yuqing Wang , Tuo Zhao , Rachel Ward , Molei Tao

The Global R-linear Convergence of Nesterov's Accelerated Gradient Method with Unknown Strongly Convex Parameter

The Nesterov accelerated gradient (NAG) method is an important extrapolation-based numerical algorithm that accelerates the convergence of the gradient descent method in convex optimization. When dealing with an objective function that is…

Optimization and Control · Mathematics 2025-05-28 Chenglong Bao , Liang Chen , Jiahong Li

Efficient Pipeline Planning for Expedited Distributed DNN Training

To train modern large DNN models, pipeline parallelism has recently emerged, which distributes the model across GPUs and enables different devices to process different microbatches in pipeline. Earlier pipeline designs allow multiple…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-23 Ziyue Luo , Xiaodong Yi , Guoping Long , Shiqing Fan , Chuan Wu , Jun Yang , Wei Lin

Boosting Accelerated Proximal Gradient Method with Adaptive Sampling for Stochastic Composite Optimization

We develop an adaptive Nesterov accelerated proximal gradient (adaNAPG) algorithm for stochastic composite optimization problems, boosting the Nesterov accelerated proximal gradient (NAPG) algorithm through the integration of an adaptive…

Optimization and Control · Mathematics 2025-07-25 Dongxuan Zhu , Weihuan Huang , Caihua Chen

AMDP: Asynchronous Multi-Directional Pipeline Parallelism for Large-Scale Models Training

Pipeline parallelism is essential for large-scale model training, but existing asynchronous approaches often degrade convergence due to parameter mismatch between forward and backward passes. We propose Asynchronous Multi-Directional…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-29 Ling Chen , Houming Wu , Wenjie Yu

Understanding Nesterov's Acceleration via Proximal Point Method

The proximal point method (PPM) is a fundamental method in optimization that is often used as a building block for designing optimization algorithms. In this work, we use the PPM method to provide conceptually simple derivations along with…

Optimization and Control · Mathematics 2022-06-03 Kwangjun Ahn , Suvrit Sra

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-03 Shiqing Fan , Yi Rong , Chen Meng , Zongyan Cao , Siyu Wang , Zhen Zheng , Chuan Wu , Guoping Long , Jun Yang , Lixue Xia , Lansong Diao , Xiaoyong Liu , Wei Lin

Technical Report: A Totally Asynchronous Nesterov's Accelerated Gradient Method for Convex Optimization

We present a totally asynchronous algorithm for convex optimization that is based on a novel generalization of Nesterov's accelerated gradient method. This algorithm is developed for fast convergence under "total asynchrony," i.e., allowing…

Optimization and Control · Mathematics 2024-06-17 Ellie Pond , April Sebok , Zachary Bell , Matthew Hale

GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

Deep neural networks (DNNs) continue to grow rapidly in size, making them infeasible to train on a single device. Pipeline parallelism is commonly used in existing DNN systems to support large-scale DNN training by partitioning a DNN into…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-29 Byungsoo Jeon , Mengdi Wu , Shiyi Cao , Sunghyun Kim , Sunghyun Park , Neeraj Aggarwal , Colin Unger , Daiyaan Arfeen , Peiyuan Liao , Xupeng Miao , Mohammad Alizadeh , Gregory R. Ganger , Tianqi Chen , Zhihao Jia

Implementation of a modified Nesterov's Accelerated quasi-Newton Method on Tensorflow

Recent studies incorporate Nesterov's accelerated gradient method for the acceleration of gradient based training. The Nesterov's Accelerated Quasi-Newton (NAQ) method has shown to drastically improve the convergence speed compared to the…

Machine Learning · Computer Science 2020-10-16 S. Indrapriyadarsini , Shahrzad Mahboubi , Hiroshi Ninomiya , Hideki Asai

NAG-GS: Semi-Implicit, Accelerated and Robust Stochastic Optimizer

Classical machine learning models such as deep neural networks are usually trained by using Stochastic Gradient Descent-based (SGD) algorithms. The classical SGD can be interpreted as a discretization of the stochastic gradient flow. In…

Optimization and Control · Mathematics 2023-10-03 Valentin Leplat , Daniil Merkulov , Aleksandr Katrutsa , Daniel Bershatsky , Olga Tsymboi , Ivan Oseledets

Provable Acceleration of Nesterov's Accelerated Gradient Method over Heavy Ball Method in Training Over-Parameterized Neural Networks

Due to its simplicity and efficiency, the first-order gradient method has been extensively employed in training neural networks. Although the optimization problem of the neural network is non-convex, recent research has proved that the…

Machine Learning · Computer Science 2024-05-09 Xin Liu , Wei Tao , Wei Li , Dazhi Zhan , Jun Wang , Zhisong Pan

First order optimization methods based on Hessian-driven Nesterov accelerated gradient flow

A novel dynamical inertial Newton system, which is called Hessian-driven Nesterov accelerated gradient (H-NAG) flow is proposed. Convergence of the continuous trajectory are established via tailored Lyapunov function, and new first-order…

Optimization and Control · Mathematics 2019-12-25 Long Chen , Hao Luo