Related papers: Scalable subsampling: computation, aggregation and…
Subsampling is a computationally efficient and scalable method to draw inference in large data settings based on a subset of the data rather than needing to consider the whole dataset. When employing subsampling techniques, a crucial…
Network datasets appear across a wide range of scientific fields, including biology, physics, and the social sciences. To enable data-driven discoveries from these networks, statistical inference techniques like estimation and hypothesis…
In real-world applications, observations are often constrained to a small fraction of a system. Such spatial subsampling can be caused by the inaccessibility or the sheer size of the system, and cannot be overcome by longer sampling.…
The bootstrap is a widely used procedure for statistical inference because of its simplicity and attractive statistical properties. However, the vanilla version of bootstrap is no longer feasible computationally for many modern massive…
This article introduces subbagging (subsample aggregating) estimation approaches for big data analysis with memory constraints of computers. Specifically, for the whole dataset with size $N$, $m_N$ subsamples are randomly drawn, and each…
Computational capability often falls short when confronted with massive data, posing a common challenge in establishing a statistical model or statistical inference method dealing with big data. While subsampling techniques have been…
In today's modern era of Big data, computationally efficient and scalable methods are needed to support timely insights and informed decision making. One such method is sub-sampling, where a subset of the Big data is analysed and used as…
Subsampling methods aim to select a subsample as a surrogate for the observed sample. As a powerful technique for large-scale data analysis, various subsampling methods are developed for more effective coefficient estimation and model…
Sub-sampling is a common and often effective method to deal with the computational challenges of large datasets. However, for most statistical models, there is no well-motivated approach for drawing a non-uniform subsample. We show that the…
Bootstrapping is often applied to get confidence limits for semiparametric inference of a target parameter in the presence of nuisance parameters. Bootstrapping with replacement can be computationally expensive and problematic when…
Subsampling is one of the popular methods to balance statistical efficiency and computational efficiency in the big data era. Most approaches aim at selecting informative or representative sample points to achieve good overall information…
This article introduces a subbagging (subsample aggregating) approach for variable selection in regression within the context of big data. The proposed subbagging approach not only ensures that variable selection is scalable given the…
Subsampling from a large data set is useful in many supervised learning contexts to provide a global view of the data based on only a fraction of the observations. Diverse (or space-filling) subsampling is an appealing subsampling approach…
Several new methods have been proposed for performing valid inference after model selection. An older method is sampling splitting: use part of the data for model selection and part for inference. In this paper we revisit sample splitting…
Modern statistical analysis often encounters datasets with large sizes. For these datasets, conventional estimation methods can hardly be used immediately because practitioners often suffer from limited computational resources. In most…
The rapid development of computing power and efficient Markov Chain Monte Carlo (MCMC) simulation algorithms have revolutionized Bayesian statistics, making it a highly practical inference method in applied work. However, MCMC algorithms…
Subsampling is an efficient method to deal with massive data. In this paper, we investigate the optimal subsampling for linear quantile regression when the covariates are functions. The asymptotic distribution of the subsampling estimator…
The bootstrap is a popular and powerful method for assessing precision of estimators and inferential methods. However, for massive datasets which are increasingly prevalent, the bootstrap becomes prohibitively costly in computation and its…
Subsampling is an effective approach to alleviate the computational burden associated with large-scale datasets. Nevertheless, existing subsampling estimators incur a substantial loss in estimation efficiency compared to estimators based on…
We study the effectiveness of subagging, or subsample aggregating, on regression trees, a popular non-parametric method in machine learning. First, we give sufficient conditions for pointwise consistency of trees. We formalize that (i) the…