English

Dropout Training as Adaptive Regularization

Machine Learning 2013-11-04 v2 Machine Learning Methodology

Abstract

Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learning algorithm, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classification tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset.

Keywords

Cite

@article{arxiv.1307.1493,
  title  = {Dropout Training as Adaptive Regularization},
  author = {Stefan Wager and Sida Wang and Percy Liang},
  journal= {arXiv preprint arXiv:1307.1493},
  year   = {2013}
}

Comments

11 pages. Advances in Neural Information Processing Systems (NIPS), 2013

R2 v1 2026-06-22T00:45:56.239Z