Testing with Non-identically Distributed Samples

Shivam Garg; Chirag Pabbaraju; Kirankumar Shiragur; Gregory Valiant

Testing with Non-identically Distributed Samples

Data Structures and Algorithms 2025-11-05 v2 Information Theory Machine Learning math.IT Machine Learning

Authors: Shivam Garg , Chirag Pabbaraju , Kirankumar Shiragur , Gregory Valiant

Abstract

We examine the extent to which sublinear-sample property testing and estimation apply to settings where samples are independently but not identically distributed. Specifically, we consider the following distributional property testing framework: Suppose there is a set of distributions over a discrete support of size $k$ , $p_1, p_2,\ldots,p_T$ , and we obtain $c$ independent draws from each distribution. Suppose the goal is to learn or test a property of the average distribution, $p_{avg}$ . This setup models a number of important practical settings where the individual distributions correspond to heterogeneous entities -- either individuals, chronologically distinct time periods, spatially separated data sources, etc. From a learning standpoint, even with $c=1$ samples from each distribution, $\Theta(k/\varepsilon^2)$ samples are necessary and sufficient to learn $p_{avg}$ to within error $\varepsilon$ in $\ell_1$ distance. To test uniformity or identity -- distinguishing the case that $p_{avg}$ is equal to some reference distribution, versus has $\ell_1$ distance at least $\varepsilon$ from the reference distribution, we show that a linear number of samples in $k$ is necessary given $c=1$ samples from each distribution. In contrast, for $c \ge 2$ , we recover the usual sublinear sample testing guarantees of the i.i.d.\ setting: we show that $O(\sqrt{k}/\varepsilon^2 + 1/\varepsilon^4)$ total samples are sufficient, matching the optimal sample complexity in the i.i.d.\ case in the regime where $\varepsilon \ge k^{-1/4}$ . Additionally, we show that in the $c=2$ case, there is a constant $\rho > 0$ such that even in the linear regime with $\rho k$ samples, no tester that considers the multiset of samples (ignoring which samples were drawn from the same $p_i$ ) can perform uniformity testing. We also extend our techniques to the problem of testing "closeness" of two distributions.

Keywords

group testing randomized algorithm probability theory

Cite

@article{arxiv.2311.11194,
  title  = {Testing with Non-identically Distributed Samples},
  author = {Shivam Garg and Chirag Pabbaraju and Kirankumar Shiragur and Gregory Valiant},
  journal= {arXiv preprint arXiv:2311.11194},
  year   = {2025}
}

Testing with Non-identically Distributed Samples

Abstract

Keywords

Cite

Related papers