Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Kyriakos Axiotis; Vincent Cohen-Addad; Monika Henzinger; Sammy Jerome; Vahab Mirrokni; David Saulpic; David Woodruff; Michael Wunder

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Machine Learning 2024-02-28 v1 Data Structures and Algorithms

Authors: Kyriakos Axiotis , Vincent Cohen-Addad , Monika Henzinger , Sammy Jerome , Vahab Mirrokni , David Saulpic , David Woodruff , Michael Wunder

View on arXiv ↗ PDF ↗

Abstract

We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$ -means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is H\"older continuous, our approach provably allows selecting a set of ``typical'' $k + 1/\varepsilon^2$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative $(1\pm\varepsilon)$ factor and an additive $\varepsilon \lambda \Phi_k$ , where $\Phi_k$ represents the $k$ -means cost for the input embeddings and $\lambda$ is the H\"older constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling, while being conceptually simpler and more scalable.

Keywords

cluster analysis randomized algorithm feature selection

Cite

@article{arxiv.2402.17327,
  title  = {Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond},
  author = {Kyriakos Axiotis and Vincent Cohen-Addad and Monika Henzinger and Sammy Jerome and Vahab Mirrokni and David Saulpic and David Woodruff and Michael Wunder},
  journal= {arXiv preprint arXiv:2402.17327},
  year   = {2024}
}

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Abstract

Keywords

Cite

Related papers