Related papers: Large Batch Training Does Not Need Warmup

Revisiting LARS for Large Batch Training Generalization of Neural Networks

This paper explores Large Batch Training techniques using layer-wise adaptive scaling ratio (LARS) across diverse settings, uncovering insights. LARS algorithms with warm-up tend to be trapped in sharp minimizers early on due to redundant…

Machine Learning · Computer Science 2024-08-28 Khoi Do , Duong Nguyen , Hoa Nguyen , Long Tran-Thanh , Nguyen-Hoang Tran , Quoc-Viet Pham

Large Batch Training of Convolutional Networks

A common way to speed up training of large convolutional networks is to add computational units. Training is then performed using data-parallel synchronous Stochastic Gradient Descent (SGD) with mini-batch divided between computational…

Computer Vision and Pattern Recognition · Computer Science 2017-09-15 Yang You , Igor Gitman , Boris Ginsburg

Automated Learning Rate Scheduler for Large-batch Training

Large-batch training has been essential in leveraging large-scale datasets and models in deep learning. While it is computationally beneficial to use large batch sizes, it often requires a specially designed learning rate (LR) schedule to…

Machine Learning · Computer Science 2021-07-14 Chiheon Kim , Saehoon Kim , Jongmin Kim , Donghoon Lee , Sungwoong Kim

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Training large deep neural networks on massive datasets is computationally very challenging. There has been recent surge in interest in using large batch stochastic optimization methods to tackle this issue. The most prominent algorithm in…

Machine Learning · Computer Science 2020-01-06 Yang You , Jing Li , Sashank Reddi , Jonathan Hseu , Sanjiv Kumar , Srinadh Bhojanapalli , Xiaodan Song , James Demmel , Kurt Keutzer , Cho-Jui Hsieh

Evaluating Deep Learning in SystemML using Layer-wise Adaptive Rate Scaling(LARS) Optimizer

Increasing the batch size of a deep learning model is a challenging task. Although it might help in utilizing full available system memory during training phase of a model, it results in significant loss of test accuracy most often. LARS…

Machine Learning · Computer Science 2021-02-08 Kanchan Chowdhury , Ankita Sharma , Arun Deepak Chandrasekar

Study on the Large Batch Size Training of Neural Networks Based on the Second Order Gradient

Large batch size training in deep neural networks (DNNs) possesses a well-known 'generalization gap' that remarkably induces generalization performance degradation. However, it remains unclear how varying batch size affects the structure of…

Machine Learning · Computer Science 2020-12-17 Fengli Gao , Huicai Zhong

Large-Batch Training for LSTM and Beyond

Large-batch training approaches have enabled researchers to utilize large-scale distributed processing and greatly accelerate deep-neural net (DNN) training. For example, by scaling the batch size from 256 to 32K, researchers have been able…

Machine Learning · Computer Science 2019-01-25 Yang You , Jonathan Hseu , Chris Ying , James Demmel , Kurt Keutzer , Cho-Jui Hsieh

Improving Layer-wise Adaptive Rate Methods using Trust Ratio Clipping

Training neural networks with large batch is of fundamental significance to deep learning. Large batch training remarkably reduces the amount of training time but has difficulties in maintaining accuracy. Recent works have put forward…

Machine Learning · Computer Science 2020-11-30 Jeffrey Fong , Siwei Chen , Kaiqi Chen

Class Adaptive Network Calibration

Recent studies have revealed that, beyond conventional accuracy, calibration should also be considered for training modern deep neural networks. To address miscalibration during learning, some methods have explored different penalty…

Computer Vision and Pattern Recognition · Computer Science 2023-04-13 Bingyuan Liu , Jérôme Rony , Adrian Galdran , Jose Dolz , Ismail Ben Ayed

Theoretical Analysis on how Learning Rate Warmup Accelerates Convergence

Learning rate warmup is a popular and practical technique in training large-scale deep neural networks. Despite the huge success in practice, the theoretical advantages of this strategy of gradually increasing the learning rate at the…

Machine Learning · Computer Science 2025-09-10 Yuxing Liu , Yuze Ge , Rui Pan , An Kang , Tong Zhang

AdaScale SGD: A User-Friendly Algorithm for Distributed Training

When using large-batch training to speed up stochastic gradient descent, learning rates must adapt to new batch sizes in order to maximize speed-ups and preserve model quality. Re-tuning learning rates is resource intensive, while fixed…

Machine Learning · Computer Science 2020-07-13 Tyler B. Johnson , Pulkit Agrawal , Haijie Gu , Carlos Guestrin

AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks

Training deep neural networks with Stochastic Gradient Descent, or its variants, requires careful choice of both learning rate and batch size. While smaller batch sizes generally converge in fewer training epochs, larger batch sizes offer…

Machine Learning · Computer Science 2018-02-15 Aditya Devarakonda , Maxim Naumov , Michael Garland

Revisiting Small Batch Training for Deep Neural Networks

Modern deep neural network training is typically based on mini-batch stochastic gradient optimization. While the use of large mini-batches increases the available computational parallelism, small batch training has been shown to provide…

Machine Learning · Computer Science 2018-04-23 Dominic Masters , Carlo Luschi

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

Background: Deep learning models are typically trained using stochastic gradient descent or one of its variants. These methods update the weights using their gradient, estimated from a small fraction of the training data. It has been…

Machine Learning · Statistics 2018-01-03 Elad Hoffer , Itay Hubara , Daniel Soudry

DIVEBATCH: Accelerating Model Training Through Gradient-Diversity Aware Batch Size Adaptation

The goal of this paper is to accelerate the training of machine learning models, a critical challenge since the training of large-scale deep neural models can be computationally expensive. Stochastic gradient descent (SGD) and its variants…

Machine Learning · Computer Science 2025-09-22 Yuen Chen , Yian Wang , Hari Sundaram

Augment your batch: better training with larger batches

Large-batch SGD is important for scaling training of deep neural networks. However, without fine-tuning hyperparameter schedules, the generalization of the model may be hampered. We propose to use batch augmentation: replicating instances…

Machine Learning · Computer Science 2019-01-29 Elad Hoffer , Tal Ben-Nun , Itay Hubara , Niv Giladi , Torsten Hoefler , Daniel Soudry

Large-Scale Training System for 100-Million Classification at Alibaba

In the last decades, extreme classification has become an essential topic for deep learning. It has achieved great success in many areas, especially in computer vision and natural language processing (NLP). However, it is very challenging…

Machine Learning · Computer Science 2021-02-12 Liuyihan Song , Pan Pan , Kang Zhao , Hao Yang , Yiming Chen , Yingya Zhang , Yinghui Xu , Rong Jin

Optimizing Data Curation through Spectral Analysis and Joint Batch Selection (SALN)

In modern deep learning models, long training times and large datasets present significant challenges to both efficiency and scalability. Effective data curation and sample selection are crucial for optimizing the training process of deep…

Machine Learning · Computer Science 2024-12-24 Mohammadreza Sharifi

Training Deep Neural Networks Without Batch Normalization

Training neural networks is an optimization problem, and finding a decent set of parameters through gradient descent can be a difficult task. A host of techniques has been developed to aid this process before and during the training phase.…

Machine Learning · Computer Science 2020-08-19 Divya Gaur , Joachim Folz , Andreas Dengel

Why Do We Need Warm-up? A Theoretical Perspective

Learning rate warm-up - increasing the learning rate at the beginning of training - has become a ubiquitous heuristic in modern deep learning, yet its theoretical foundations remain poorly understood. In this work, we provide a principled…

Machine Learning · Computer Science 2025-10-06 Foivos Alimisis , Rustem Islamov , Aurelien Lucchi