Scalable subsampling: computation, aggregation and inference

Dimitris N. Politis

Scalable subsampling: computation, aggregation and inference

Statistics Theory 2021-12-14 v1 Statistics Theory

Authors: Dimitris N. Politis

Abstract

Subsampling is a general statistical method developed in the 1990s aimed at estimating the sampling distribution of a statistic $\hat \theta _n$ in order to conduct nonparametric inference such as the construction of confidence intervals and hypothesis tests. Subsampling has seen a resurgence in the Big Data era where the standard, full-resample size bootstrap can be infeasible to compute. Nevertheless, even choosing a single random subsample of size $b$ can be computationally challenging with both $b$ and the sample size $n$ being very large. In the paper at hand, we show how a set of appropriately chosen, non-random subsamples can be used to conduct effective -- and computationally feasible -- distribution estimation via subsampling. Further, we show how the same set of subsamples can be used to yield a procedure for subsampling aggregation -- also known as subagging -- that is scalable with big data. Interestingly, the scalable subagging estimator can be tuned to have the same (or better) rate of convergence as compared to $\hat \theta _n$ . The paper is concluded by showing how to conduct inference, e.g., confidence intervals, based on the scalable subagging estimator instead of the original $\hat \theta _n$ .

Keywords

statistical inference nonparametric regression covariance estimation

Cite

@article{arxiv.2112.06434,
  title  = {Scalable subsampling: computation, aggregation and inference},
  author = {Dimitris N. Politis},
  journal= {arXiv preprint arXiv:2112.06434},
  year   = {2021}
}

Scalable subsampling: computation, aggregation and inference

Abstract

Keywords

Cite

Related papers