Related papers: Minibatching Offers Improved Generalization Perfor…

Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks

Large-scale distributed training of deep neural networks suffer from the generalization gap caused by the increase in the effective mini-batch size. Previous approaches try to solve this problem by varying the learning rate and batch size…

Machine Learning · Computer Science 2019-04-02 Kazuki Osawa , Yohei Tsuji , Yuichiro Ueno , Akira Naruse , Rio Yokota , Satoshi Matsuoka

Revisiting Small Batch Training for Deep Neural Networks

Modern deep neural network training is typically based on mini-batch stochastic gradient optimization. While the use of large mini-batches increases the available computational parallelism, small batch training has been shown to provide…

Machine Learning · Computer Science 2018-04-23 Dominic Masters , Carlo Luschi

Rolling the dice for better deep learning performance: A study of randomness techniques in deep neural networks

This paper investigates how various randomization techniques impact Deep Neural Networks (DNNs). Randomization, like weight noise and dropout, aids in reducing overfitting and enhancing generalization, but their interactions are poorly…

Machine Learning · Computer Science 2024-04-08 Mohammed Ghaith Altarabichi , Sławomir Nowaczyk , Sepideh Pashami , Peyman Sheikholharam Mashhadi , Julia Handl

Optimal Growth Schedules for Batch Size and Learning Rate in SGD that Reduce SFO Complexity

The unprecedented growth of deep learning models has enabled remarkable advances but introduced substantial computational bottlenecks. A key factor contributing to training efficiency is batch-size and learning-rate scheduling in stochastic…

Machine Learning · Computer Science 2025-08-08 Hikaru Umeda , Hideaki Iiduka

Second-Order Stochastic Optimization for Machine Learning in Linear Time

First-order stochastic methods are the state-of-the-art in large-scale machine learning optimization owing to efficient per-iteration complexity. Second-order methods, while able to provide faster convergence, have been much less explored…

Machine Learning · Statistics 2017-12-01 Naman Agarwal , Brian Bullins , Elad Hazan

On the Parameterization of Second-Order Optimization Effective Towards the Infinite Width

Second-order optimization has been developed to accelerate the training of deep neural networks and it is being applied to increasingly larger-scale models. In this study, towards training on further larger scales, we identify a specific…

Machine Learning · Computer Science 2024-06-11 Satoki Ishikawa , Ryo Karakida

Efficient Neural Network Training via Subset Pretraining

In training neural networks, it is common practice to use partial gradients computed over batches, mostly very small subsets of the training set. This approach is motivated by the argument that such a partial gradient is close to the true…

Machine Learning · Computer Science 2024-11-25 Jan Spörer , Bernhard Bermeitinger , Tomas Hrycej , Niklas Limacher , Siegfried Handschuh

Approximate Newton Methods

Many machine learning models involve solving optimization problems. Thus, it is important to deal with a large-scale optimization problem in big data applications. Recently, subsampled Newton methods have emerged to attract much attention…

Numerical Analysis · Computer Science 2020-03-24 Haishan Ye , Luo Luo , Zhihua Zhang

A New Perspective for Understanding Generalization Gap of Deep Neural Networks Trained with Large Batch Sizes

Deep neural networks (DNNs) are typically optimized using various forms of mini-batch gradient descent algorithm. A major motivation for mini-batch gradient descent is that with a suitably chosen batch size, available computing resources…

Machine Learning · Computer Science 2022-10-25 Oyebade K. Oyedotun , Konstantinos Papadopoulos , Djamila Aouada

Critical Bach Size Minimizes Stochastic First-Order Oracle Complexity of Deep Learning Optimizer using Hyperparameters Close to One

Practical results have shown that deep learning optimizers using small constant learning rates, hyperparameters close to one, and large batch sizes can find the model parameters of deep neural networks that minimize the loss functions. We…

Machine Learning · Computer Science 2022-08-23 Hideaki Iiduka

Data optimization for large batch distributed training of deep neural networks

Distributed training in deep learning (DL) is common practice as data and models grow. The current practice for distributed training of deep neural networks faces the challenges of communication bottlenecks when operating at scale, and…

Machine Learning · Computer Science 2020-12-21 Shubhankar Gahlot , Junqi Yin , Mallikarjun Shankar

DCNNs on a Diet: Sampling Strategies for Reducing the Training Set Size

Large-scale supervised classification algorithms, especially those based on deep convolutional neural networks (DCNNs), require vast amounts of training data to achieve state-of-the-art performance. Decreasing this data requirement would…

Computer Vision and Pattern Recognition · Computer Science 2016-06-15 Maya Kabkab , Azadeh Alavi , Rama Chellappa

Understanding training and generalization in deep learning by Fourier analysis

Background: It is still an open research area to theoretically understand why Deep Neural Networks (DNNs)---equipped with many more parameters than training data and trained by (stochastic) gradient-based methods---often achieve remarkably…

Machine Learning · Computer Science 2018-11-30 Zhiqin John Xu

Scalable Second Order Optimization for Deep Learning

Optimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent. Second-order optimization methods, that involve second derivatives and/or second…

Machine Learning · Computer Science 2021-03-08 Rohan Anil , Vineet Gupta , Tomer Koren , Kevin Regan , Yoram Singer

Large-Scale Deep Learning Optimizations: A Comprehensive Survey

Deep learning have achieved promising results on a wide spectrum of AI applications. Larger datasets and models consistently yield better performance. However, we generally spend longer training time on more computation and communication.…

Machine Learning · Computer Science 2021-11-03 Xiaoxin He , Fuzhao Xue , Xiaozhe Ren , Yang You

DBS: Dynamic Batch Size For Distributed Deep Neural Network Training

Synchronous strategies with data parallelism, such as the Synchronous StochasticGradient Descent (S-SGD) and the model averaging methods, are widely utilizedin distributed training of Deep Neural Networks (DNNs), largely owing to itseasy…

Machine Learning · Computer Science 2022-11-04 Qing Ye , Yuhao Zhou , Mingjia Shi , Yanan Sun , Jiancheng Lv

Adaptive First- and Second-Order Algorithms for Large-Scale Machine Learning

In this paper, we consider both first- and second-order techniques to address continuous optimization problems arising in machine learning. In the first-order case, we propose a framework of transition from deterministic or…

Machine Learning · Computer Science 2021-11-30 Sanae Lotfi , Tiphaine Bonniot de Ruisselet , Dominique Orban , Andrea Lodi

Efficient Training Under Limited Resources

Training time budget and size of the dataset are among the factors affecting the performance of a Deep Neural Network (DNN). This paper shows that Neural Architecture Search (NAS), Hyper Parameters Optimization (HPO), and Data Augmentation…

Machine Learning · Computer Science 2023-01-24 Mahdi Zolnouri , Dounia Lakhmiri , Christophe Tribes , Eyyüb Sari , Sébastien Le Digabel

Don't Use Large Mini-Batches, Use Local SGD

Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years. However,…

Machine Learning · Computer Science 2020-02-18 Tao Lin , Sebastian U. Stich , Kumar Kshitij Patel , Martin Jaggi

Measuring the Effects of Data Parallelism on Neural Network Training

Recent hardware developments have dramatically increased the scale of data parallelism available for neural network training. Among the simplest ways to harness next-generation hardware is to increase the batch size in standard mini-batch…

Machine Learning · Computer Science 2019-07-22 Christopher J. Shallue , Jaehoon Lee , Joseph Antognini , Jascha Sohl-Dickstein , Roy Frostig , George E. Dahl