English

A Framework for Estimating Stream Expression Cardinalities

Data Structures and Algorithms 2016-02-25 v2

Abstract

Given mm distributed data streams A1,,AmA_1, \dots, A_m, we consider the problem of estimating the number of unique identifiers in streams defined by set expressions over A1,,AmA_1, \dots, A_m. We identify a broad class of algorithms for solving this problem, and show that the estimators output by any algorithm in this class are perfectly unbiased and satisfy strong variance bounds. Our analysis unifies and generalizes a variety of earlier results in the literature. To demonstrate its generality, we describe several novel sampling algorithms in our class, and show that they achieve a novel tradeoff between accuracy, space usage, update speed, and applicability.

Keywords

Cite

@article{arxiv.1510.01455,
  title  = {A Framework for Estimating Stream Expression Cardinalities},
  author = {Anirban Dasgupta and Kevin Lang and Lee Rhodes and Justin Thaler},
  journal= {arXiv preprint arXiv:1510.01455},
  year   = {2016}
}
R2 v1 2026-06-22T11:13:35.024Z