English

Data Selection for ERMs

Machine Learning 2025-04-29 v2 Machine Learning

Abstract

Learning theory has traditionally followed a model-centric approach, focusing on designing optimal algorithms for a fixed natural learning task (e.g., linear classification or regression). In this paper, we adopt a complementary data-centric perspective, whereby we fix a natural learning rule and focus on optimizing the training data. Specifically, we study the following question: given a learning rule A\mathcal{A} and a data selection budget nn, how well can A\mathcal{A} perform when trained on at most nn data points selected from a population of NN points? We investigate when it is possible to select nNn \ll N points and achieve performance comparable to training on the entire population. We address this question across a variety of empirical risk minimizers. Our results include optimal data-selection bounds for mean estimation, linear classification, and linear regression. Additionally, we establish two general results: a taxonomy of error rates in binary classification and in stochastic convex optimization. Finally, we propose several open questions and directions for future research.

Keywords

Cite

@article{arxiv.2504.14572,
  title  = {Data Selection for ERMs},
  author = {Steve Hanneke and Shay Moran and Alexander Shlimovich and Amir Yehudayoff},
  journal= {arXiv preprint arXiv:2504.14572},
  year   = {2025}
}
R2 v1 2026-06-28T23:04:41.044Z