English

Testing with Non-identically Distributed Samples

Data Structures and Algorithms 2025-11-05 v2 Information Theory Machine Learning math.IT Machine Learning

Abstract

We examine the extent to which sublinear-sample property testing and estimation apply to settings where samples are independently but not identically distributed. Specifically, we consider the following distributional property testing framework: Suppose there is a set of distributions over a discrete support of size kk, p1,p2,,pTp_1, p_2,\ldots,p_T, and we obtain cc independent draws from each distribution. Suppose the goal is to learn or test a property of the average distribution, pavgp_{avg}. This setup models a number of important practical settings where the individual distributions correspond to heterogeneous entities -- either individuals, chronologically distinct time periods, spatially separated data sources, etc. From a learning standpoint, even with c=1c=1 samples from each distribution, Θ(k/ε2)\Theta(k/\varepsilon^2) samples are necessary and sufficient to learn pavgp_{avg} to within error ε\varepsilon in 1\ell_1 distance. To test uniformity or identity -- distinguishing the case that pavgp_{avg} is equal to some reference distribution, versus has 1\ell_1 distance at least ε\varepsilon from the reference distribution, we show that a linear number of samples in kk is necessary given c=1c=1 samples from each distribution. In contrast, for c2c \ge 2, we recover the usual sublinear sample testing guarantees of the i.i.d.\ setting: we show that O(k/ε2+1/ε4)O(\sqrt{k}/\varepsilon^2 + 1/\varepsilon^4) total samples are sufficient, matching the optimal sample complexity in the i.i.d.\ case in the regime where εk1/4\varepsilon \ge k^{-1/4}. Additionally, we show that in the c=2c=2 case, there is a constant ρ>0\rho > 0 such that even in the linear regime with ρk\rho k samples, no tester that considers the multiset of samples (ignoring which samples were drawn from the same pip_i) can perform uniformity testing. We also extend our techniques to the problem of testing "closeness" of two distributions.

Keywords

Cite

@article{arxiv.2311.11194,
  title  = {Testing with Non-identically Distributed Samples},
  author = {Shivam Garg and Chirag Pabbaraju and Kirankumar Shiragur and Gregory Valiant},
  journal= {arXiv preprint arXiv:2311.11194},
  year   = {2025}
}