English

Active Sequential Two-Sample Testing

Machine Learning 2024-07-01 v4 Methodology

Abstract

A two-sample hypothesis test is a statistical procedure used to determine whether the distributions generating two samples are identical. We consider the two-sample testing problem in a new scenario where the sample measurements (or sample features) are inexpensive to access, but their group memberships (or labels) are costly. To address the problem, we devise the first \emph{active sequential two-sample testing framework} that not only sequentially but also \emph{actively queries}. Our test statistic is a likelihood ratio where one likelihood is found by maximization over all class priors, and the other is provided by a probabilistic classification model. The classification model is adaptively updated and used to predict where the (unlabelled) features have a high dependency on labels; labeling the ``high-dependency'' features leads to the increased power of the proposed testing framework. In theory, we provide the proof that our framework produces an \emph{anytime-valid} pp-value. In addition, we characterize the proposed framework's gain in testing power by analyzing the mutual information between the feature and label variables in asymptotic and finite-sample scenarios. In practice, we introduce an instantiation of our framework and evaluate it using several experiments; the experiments on the synthetic, MNIST, and application-specific datasets demonstrate that the testing power of the instantiated active sequential test significantly increases while the Type I error is under control.

Keywords

Cite

@article{arxiv.2301.12616,
  title  = {Active Sequential Two-Sample Testing},
  author = {Weizhi Li and Prad Kadambi and Pouria Saidi and Karthikeyan Natesan Ramamurthy and Gautam Dasarathy and Visar Berisha},
  journal= {arXiv preprint arXiv:2301.12616},
  year   = {2024}
}