English

Testing Data Binnings

Data Structures and Algorithms 2020-04-28 v1 Discrete Mathematics

Abstract

Motivated by the question of data quantization and "binning," we revisit the problem of identity testing of discrete probability distributions. Identity testing (a.k.a. one-sample testing), a fundamental and by now well-understood problem in distribution testing, asks, given a reference distribution (model) q\mathbf{q} and samples from an unknown distribution p\mathbf{p}, both over [n]={1,2,,n}[n]=\{1,2,\dots,n\}, whether p\mathbf{p} equals q\mathbf{q}, or is significantly different from it. In this paper, we introduce the related question of 'identity up to binning,' where the reference distribution q\mathbf{q} is over knk \ll n elements: the question is then whether there exists a suitable binning of the domain [n][n] into kk intervals such that, once "binned," p\mathbf{p} is equal to q\mathbf{q}. We provide nearly tight upper and lower bounds on the sample complexity of this new question, showing both a quantitative and qualitative difference with the vanilla identity testing one, and answering an open question of Canonne (2019). Finally, we discuss several extensions and related research directions.

Keywords

Cite

@article{arxiv.2004.12893,
  title  = {Testing Data Binnings},
  author = {Clément L. Canonne and Karl Wimmer},
  journal= {arXiv preprint arXiv:2004.12893},
  year   = {2020}
}