Related papers: Distributed Statistical Inference for Massive Data
The rapid emergence of massive datasets in various fields poses a serious challenge to traditional statistical methods. Meanwhile, it provides opportunities for researchers to develop novel algorithms. Inspired by the idea of…
As the size of modern data sets exceeds the disk and memory capacities of a single computer, machine learning practitioners have resorted to parallel and distributed computing. Given that optimization is one of the pillars of machine…
In this paper, we propose a bootstrap method applied to massive data processed distributedly in a large number of machines. This new method is computationally efficient in that we bootstrap on the master machine without over-resampling,…
In this paper, we propose a new statistical inference method for massive data sets, which is very simple and efficient by combining divide-and-conquer method and empirical likelihood. Compared with two popular methods (the bag of little…
In this paper, we address the problem of conducting statistical inference in settings involving large-scale data that may be high-dimensional and contaminated by outliers. The high volume and dimensionality of the data require distributed…
We propose a distributed bootstrap method for simultaneous inference on high-dimensional massive data that are stored and processed with many machines. The method produces an $\ell_\infty$-norm confidence region based on a…
The increased availability of massive data sets provides a unique opportunity to discover subtle patterns in their distributions, but also imposes overwhelming computational challenges. To fully utilize the information contained in big…
This article introduces an iterative distributed computing estimator for the multinomial logistic regression model with large choice sets. Compared to the maximum likelihood estimator, the proposed iterative distributed estimator achieves…
In multicenter research, individual-level data are often protected against sharing across sites. To overcome the barrier of data sharing, many distributed algorithms, which only require sharing aggregated information, have been developed.…
We consider statistical inference for a finite-dimensional parameter in a regular semiparametric model under a distributed setting with blockwise missingness, where entire blocks of variables are unavailable at certain sites and sharing…
It is not unusual for a data analyst to encounter data sets distributed across several computers. This can happen for reasons such as privacy concerns, efficiency of likelihood evaluations, or just the sheer size of the whole data set. This…
This paper presents a class of new algorithms for distributed statistical estimation that exploit divide-and-conquer approach. We show that one of the key benefits of the divide-and-conquer strategy is robustness, an important…
We propose a bootstrap procedure for data that may exhibit clustering in two or more dimensions. We use insights from the theory of generalized U-statistics to analyze the large-sample properties of statistics that are sample averages from…
In multicenter biomedical research, integrating data from multiple decentralized sites provides more robust and generalizable findings due to its larger sample size and the ability to account for the between-site heterogeneity. However,…
We propose a distributed method for simultaneous inference for datasets with sample size much larger than the number of covariates, i.e., N >> p, in the generalized linear models framework. When such datasets are too big to be analyzed…
The proliferation of science and technology has led to the prevalence of voluminous data sets that are distributed across multiple machines. It is an established fact that conventional statistical methodologies may be unfeasible in the…
This paper considers distributed M-estimation under heterogeneous distributions among distributed data blocks. A weighted distributed estimator is proposed to improve the efficiency of the standard "Split-And-Conquer" (SaC) estimator for…
When data are stored across multiple locations, directly pooling all the data together for statistical analysis may be impossible due to communication costs and privacy concerns. Distributed computing systems allow the analysis of such…
In distributed, or privacy-preserving learning, we are often given a set of probabilistic models estimated from different local repositories, and asked to combine them into a single model that gives efficient statistical estimation. A…
Estimating statistical models within sensor networks requires distributed algorithms, in which both data and computation are distributed across the nodes of the network. We propose a general approach for distributed learning based on…