Subset Sampling and Its Extensions
Abstract
This paper studies the \emph{subset sampling} problem. The input is a set of records together with a function that assigns each record a probability . A query returns a random subset of , where each record is sampled into independently with probability . The goal is to store in a data structure to answer queries efficiently. If fits in memory, the problem is interesting when is dynamic. We develop a dynamic data structure with expected \emph{query} time, space and amortized expected \emph{update}, \emph{insert} and \emph{delete} time, where . The query time and space are optimal. If does not fit in memory, the problem is difficult even if is static. Under this scenario, we present an I/O-efficient algorithm that answers a \emph{query} in amortized expected I/Os using space, where is the memory size, is the block size and is the number of iterative operations we need to perform on before going below . In addition, when each record is associated with a real-valued key, we extend the \emph{subset sampling} problem to the \emph{range subset sampling} problem, in which we require that the keys of the sampled records fall within a specified input range . For this extension, we provide a solution under the dynamic setting, with expected \emph{query} time, space and amortized expected \emph{update}, \emph{insert} and \emph{delete} time.
Cite
@article{arxiv.2307.11585,
title = {Subset Sampling and Its Extensions},
author = {Jinchao Huang and Sibo Wang},
journal= {arXiv preprint arXiv:2307.11585},
year = {2023}
}
Comments
17 pages