English

Consistent Subset Sampling

Data Structures and Algorithms 2014-04-21 v1

Abstract

Consistent sampling is a technique for specifying, in small space, a subset SS of a potentially large universe UU such that the elements in SS satisfy a suitably chosen sampling condition. Given a subset IU\mathcal{I}\subseteq U it should be possible to quickly compute IS\mathcal{I}\cap S, i.e., the elements in I\mathcal{I} satisfying the sampling condition. Consistent sampling has important applications in similarity estimation, and estimation of the number of distinct items in a data stream. In this paper we generalize consistent sampling to the setting where we are interested in sampling size-kk subsets occurring in some set in a collection of sets of bounded size bb, where kk is a small integer. This can be done by applying standard consistent sampling to the kk-subsets of each set, but that approach requires time Θ(bk)\Theta(b^k). Using a carefully designed hash function, for a given sampling probability p(0,1]p \in (0,1], we show how to improve the time complexity to Θ(bk/2loglogb+pbk)\Theta(b^{\lceil k/2\rceil}\log \log b + pb^k) in expectation, while maintaining strong concentration bounds for the sample. The space usage of our method is Θ(bk/4)\Theta(b^{\lceil k/4\rceil}). We demonstrate the utility of our technique by applying it to several well-studied data mining problems. We show how to efficiently estimate the number of frequent kk-itemsets in a stream of transactions and the number of bipartite cliques in a graph given as incidence stream. Further, building upon a recent work by Campagna et al., we show that our approach can be applied to frequent itemset mining in a parallel or distributed setting. We also present applications in graph stream mining.

Keywords

Cite

@article{arxiv.1404.4693,
  title  = {Consistent Subset Sampling},
  author = {Konstantin Kutzkov and Rasmus Pagh},
  journal= {arXiv preprint arXiv:1404.4693},
  year   = {2014}
}

Comments

To appear in SWAT 2014

R2 v1 2026-06-22T03:53:29.041Z