English

Online AUC Optimization for Sparse High-Dimensional Datasets

Machine Learning 2020-09-24 v1 Artificial Intelligence Machine Learning

Abstract

The Area Under the ROC Curve (AUC) is a widely used performance measure for imbalanced classification arising from many application domains where high-dimensional sparse data is abundant. In such cases, each dd dimensional sample has only kk non-zero features with kdk \ll d, and data arrives sequentially in a streaming form. Current online AUC optimization algorithms have high per-iteration cost O(d)\mathcal{O}(d) and usually produce non-sparse solutions in general, and hence are not suitable for handling the data challenge mentioned above. In this paper, we aim to directly optimize the AUC score for high-dimensional sparse datasets under online learning setting and propose a new algorithm, \textsc{FTRL-AUC}. Our proposed algorithm can process data in an online fashion with a much cheaper per-iteration cost O(k)\mathcal{O}(k), making it amenable for high-dimensional sparse streaming data analysis. Our new algorithmic design critically depends on a novel reformulation of the U-statistics AUC objective function as the empirical saddle point reformulation, and the innovative introduction of the "lazy update" rule so that the per-iteration complexity is dramatically reduced from O(d)\mathcal{O}(d) to O(k)\mathcal{O}(k). Furthermore, \textsc{FTRL-AUC} can inherently capture sparsity more effectively by applying a generalized Follow-The-Regularized-Leader (FTRL) framework. Experiments on real-world datasets demonstrate that \textsc{FTRL-AUC} significantly improves both run time and model sparsity while achieving competitive AUC scores compared with the state-of-the-art methods. Comparison with the online learning method for logistic loss demonstrates that \textsc{FTRL-AUC} achieves higher AUC scores especially when datasets are imbalanced.

Keywords

Cite

@article{arxiv.2009.10867,
  title  = {Online AUC Optimization for Sparse High-Dimensional Datasets},
  author = {Baojian Zhou and Yiming Ying and Steven Skiena},
  journal= {arXiv preprint arXiv:2009.10867},
  year   = {2020}
}

Comments

20th IEEE International Conference on Data Mining