Sets Clustering
Abstract
The input to the \emph{sets--means} problem is an integer and a set of sets in . The goal is to compute a set of centers (points) in that minimizes the sum of squared distances to these sets. An \emph{-core-set} for this problem is a weighted subset of that approximates this sum up to factor, for \emph{every} set of centers in . We prove that such a core-set of sets always exists, and can be computed in time, for every input and every fixed and . The result easily generalized for any metric space, distances to the power of , and M-estimators that handle outliers. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS ( approximation) for the sets--means problem that takes time near linear in . This is the first result even for sets-mean on the plane (, ). Open source code and experimental results for document classification and facility locations are also provided.
Cite
@article{arxiv.2003.04135,
title = {Sets Clustering},
author = {Ibrahim Jubran and Murad Tukan and Alaa Maalouf and Dan Feldman},
journal= {arXiv preprint arXiv:2003.04135},
year = {2020}
}