English

Sets Clustering

Machine Learning 2020-03-10 v1 Machine Learning

Abstract

The input to the \emph{sets-kk-means} problem is an integer k1k\geq 1 and a set P={P1,,Pn}\mathcal{P}=\{P_1,\cdots,P_n\} of sets in Rd\mathbb{R}^d. The goal is to compute a set CC of kk centers (points) in Rd\mathbb{R}^d that minimizes the sum PPminpP,cCpc2\sum_{P\in \mathcal{P}} \min_{p\in P, c\in C}\left\| p-c \right\|^2 of squared distances to these sets. An \emph{ε\varepsilon-core-set} for this problem is a weighted subset of P\mathcal{P} that approximates this sum up to 1±ε1\pm\varepsilon factor, for \emph{every} set CC of kk centers in Rd\mathbb{R}^d. We prove that such a core-set of O(log2n)O(\log^2{n}) sets always exists, and can be computed in O(nlogn)O(n\log{n}) time, for every input P\mathcal{P} and every fixed d,k1d,k\geq 1 and ε(0,1)\varepsilon \in (0,1). The result easily generalized for any metric space, distances to the power of z>0z>0, and M-estimators that handle outliers. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS (1+ε1+\varepsilon approximation) for the sets-kk-means problem that takes time near linear in nn. This is the first result even for sets-mean on the plane (k=1k=1, d=2d=2). Open source code and experimental results for document classification and facility locations are also provided.

Keywords

Cite

@article{arxiv.2003.04135,
  title  = {Sets Clustering},
  author = {Ibrahim Jubran and Murad Tukan and Alaa Maalouf and Dan Feldman},
  journal= {arXiv preprint arXiv:2003.04135},
  year   = {2020}
}
R2 v1 2026-06-23T14:08:46.957Z