Related papers: Angle based dynamic learning rate for gradient des…
We address the challenge of estimating the learning rate for adaptive gradient methods used in training deep neural networks. While several learning-rate-free approaches have been proposed, they are typically tailored for steepest descent.…
We introduce a general method for improving the convergence rate of gradient-based optimizers that is easy to implement and works well in practice. We demonstrate the effectiveness of the method in a range of optimization problems by…
Any gradient descent optimization requires to choose a learning rate. With deeper and deeper models, tuning that learning rate can easily become tedious and does not necessarily lead to an ideal convergence. We propose a variation of the…
Optimizing deep neural networks is largely thought to be an empirical process, requiring manual tuning of several hyper-parameters, such as learning rate, weight decay, and dropout rate. Arguably, the learning rate is the most important of…
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the…
The performance of gradient-based optimization methods, such as standard gradient descent (GD), greatly depends on the choice of learning rate. However, it can require a non-trivial amount of user tuning effort to select an appropriate…
Learning rate adaptation is a popular topic in machine learning. Gradient Descent trains neural nerwork with a fixed learning rate. Learning rate adaptation is proposed to accelerate the training process through adjusting the step size in…
Stochastic gradient algorithms are the main focus of large-scale optimization problems and led to important successes in the recent advancement of the deep learning algorithms. The convergence of SGD depends on the careful choice of…
In this paper, we describe a new way to get convergence rates for optimal methods in smooth (strongly) convex optimization tasks. Our approach is based on results for tasks where gradients have nonrandom small noises. Unlike previous…
In this paper, a new gradient-based optimization approach by automatically adjusting the learning rate is proposed. This approach can be applied to design non-adaptive learning rate and adaptive learning rate. Firstly, I will introduce the…
In this paper, we aim at providing an introduction to the gradient descent based optimization algorithms for learning deep neural network models. Deep learning models involving multiple nonlinear projection layers are very challenging to…
Tree-based models are widely recognized for their interpretability and have proven effective in various application domains, particularly in high-stakes domains. However, learning decision trees (DTs) poses a significant challenge due to…
Adaptive gradient methods, especially Adam-type methods (such as Adam, AMSGrad, and AdaBound), have been proposed to speed up the training process with an element-wise scaling term on learning rates. However, they often generalize poorly…
As deep learning models and datasets rapidly scale up, network training is extremely time-consuming and resource-costly. Instead of training on the entire dataset, learning with a small synthetic dataset becomes an efficient solution.…
We introduce a novel algorithm for gradient-based optimization of stochastic objective functions. The method may be seen as a variant of SGD with momentum equipped with an adaptive learning rate automatically adjusted by an 'energy'…
In this paper, we propose a new way to obtain optimal convergence rates for smooth stochastic (strong) convex optimization tasks. Our approach is based on results for optimization tasks where gradients have nonrandom noise. In contrast to…
Natural gradient descent is a principled method for adapting the parameters of a statistical model on-line using an underlying Riemannian parameter space to redefine the direction of steepest descent. The algorithm is examined via methods…
We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient…
In this work, we propose an optimization algorithm which we call norm-adapted gradient descent. This algorithm is similar to other gradient-based optimization algorithms like Adam or Adagrad in that it adapts the learning rate of stochastic…
Decision Trees (DTs) are commonly used for many machine learning tasks due to their high degree of interpretability. However, learning a DT from data is a difficult optimization problem, as it is non-convex and non-differentiable.…