English

Subset Sampling and Its Extensions

Data Structures and Algorithms 2023-07-24 v1 Databases

Abstract

This paper studies the \emph{subset sampling} problem. The input is a set S\mathcal{S} of nn records together with a function p\textbf{p} that assigns each record vSv\in\mathcal{S} a probability p(v)\textbf{p}(v). A query returns a random subset XX of S\mathcal{S}, where each record vSv\in\mathcal{S} is sampled into XX independently with probability p(v)\textbf{p}(v). The goal is to store S\mathcal{S} in a data structure to answer queries efficiently. If S\mathcal{S} fits in memory, the problem is interesting when S\mathcal{S} is dynamic. We develop a dynamic data structure with O(1+μS)\mathcal{O}(1+\mu_{\mathcal{S}}) expected \emph{query} time, O(n)\mathcal{O}(n) space and O(1)\mathcal{O}(1) amortized expected \emph{update}, \emph{insert} and \emph{delete} time, where μS=vSp(v)\mu_{\mathcal{S}}=\sum_{v\in\mathcal{S}}\textbf{p}(v). The query time and space are optimal. If S\mathcal{S} does not fit in memory, the problem is difficult even if S\mathcal{S} is static. Under this scenario, we present an I/O-efficient algorithm that answers a \emph{query} in O((logBn)/B+(μS/B)logM/B(n/B))\mathcal{O}\left((\log^*_B n)/B+(\mu_\mathcal{S}/B)\log_{M/B} (n/B)\right) amortized expected I/Os using O(n/B)\mathcal{O}(n/B) space, where MM is the memory size, BB is the block size and logBn\log^*_B n is the number of iterative log2(.)\log_2(.) operations we need to perform on nn before going below BB. In addition, when each record is associated with a real-valued key, we extend the \emph{subset sampling} problem to the \emph{range subset sampling} problem, in which we require that the keys of the sampled records fall within a specified input range [a,b][a,b]. For this extension, we provide a solution under the dynamic setting, with O(logn+μS[a,b])\mathcal{O}(\log n+\mu_{\mathcal{S}\cap[a,b]}) expected \emph{query} time, O(n)\mathcal{O}(n) space and O(logn)\mathcal{O}(\log n) amortized expected \emph{update}, \emph{insert} and \emph{delete} time.

Keywords

Cite

@article{arxiv.2307.11585,
  title  = {Subset Sampling and Its Extensions},
  author = {Jinchao Huang and Sibo Wang},
  journal= {arXiv preprint arXiv:2307.11585},
  year   = {2023}
}

Comments

17 pages