English

Finding High-Value Training Data Subset through Differentiable Convex Programming

Machine Learning 2021-04-29 v1

Abstract

Finding valuable training data points for deep neural networks has been a core research challenge with many applications. In recent years, various techniques for calculating the "value" of individual training datapoints have been proposed for explaining trained models. However, the value of a training datapoint also depends on other selected training datapoints - a notion that is not explicitly captured by existing methods. In this paper, we study the problem of selecting high-value subsets of training data. The key idea is to design a learnable framework for online subset selection, which can be learned using mini-batches of training data, thus making our method scalable. This results in a parameterized convex subset selection problem that is amenable to a differentiable convex programming paradigm, thus allowing us to learn the parameters of the selection model in end-to-end training. Using this framework, we design an online alternating minimization-based algorithm for jointly learning the parameters of the selection model and ML model. Extensive evaluation on a synthetic dataset, and three standard datasets, show that our algorithm finds consistently higher value subsets of training data, compared to the recent state-of-the-art methods, sometimes ~20% higher value than existing methods. The subsets are also useful in finding mislabelled training data. Our algorithm takes running time comparable to the existing valuation functions.

Keywords

Cite

@article{arxiv.2104.13794,
  title  = {Finding High-Value Training Data Subset through Differentiable Convex Programming},
  author = {Soumi Das and Arshdeep Singh and Saptarshi Chatterjee and Suparna Bhattacharya and Sourangshu Bhattacharya},
  journal= {arXiv preprint arXiv:2104.13794},
  year   = {2021}
}
R2 v1 2026-06-24T01:36:05.439Z