Testing with Non-identically Distributed Samples
Abstract
We examine the extent to which sublinear-sample property testing and estimation apply to settings where samples are independently but not identically distributed. Specifically, we consider the following distributional property testing framework: Suppose there is a set of distributions over a discrete support of size , , and we obtain independent draws from each distribution. Suppose the goal is to learn or test a property of the average distribution, . This setup models a number of important practical settings where the individual distributions correspond to heterogeneous entities -- either individuals, chronologically distinct time periods, spatially separated data sources, etc. From a learning standpoint, even with samples from each distribution, samples are necessary and sufficient to learn to within error in distance. To test uniformity or identity -- distinguishing the case that is equal to some reference distribution, versus has distance at least from the reference distribution, we show that a linear number of samples in is necessary given samples from each distribution. In contrast, for , we recover the usual sublinear sample testing guarantees of the i.i.d.\ setting: we show that total samples are sufficient, matching the optimal sample complexity in the i.i.d.\ case in the regime where . Additionally, we show that in the case, there is a constant such that even in the linear regime with samples, no tester that considers the multiset of samples (ignoring which samples were drawn from the same ) can perform uniformity testing. We also extend our techniques to the problem of testing "closeness" of two distributions.
Cite
@article{arxiv.2311.11194,
title = {Testing with Non-identically Distributed Samples},
author = {Shivam Garg and Chirag Pabbaraju and Kirankumar Shiragur and Gregory Valiant},
journal= {arXiv preprint arXiv:2311.11194},
year = {2025}
}