English
Related papers

Related papers: Distributed Statistical Inference for Massive Data

200 papers

The rapid emergence of massive datasets in various fields poses a serious challenge to traditional statistical methods. Meanwhile, it provides opportunities for researchers to develop novel algorithms. Inspired by the idea of…

Computation · Statistics 2023-04-14 Yuan Gao , Weidong Liu , Hansheng Wang , Xiaozhou Wang , Yibo Yan , Riquan Zhang

As the size of modern data sets exceeds the disk and memory capacities of a single computer, machine learning practitioners have resorted to parallel and distributed computing. Given that optimization is one of the pillars of machine…

Machine Learning · Statistics 2019-12-10 Biyi Fang , Diego Klabjan

In this paper, we propose a bootstrap method applied to massive data processed distributedly in a large number of machines. This new method is computationally efficient in that we bootstrap on the master machine without over-resampling,…

Machine Learning · Statistics 2020-02-21 Yang Yu , Shih-Kang Chao , Guang Cheng

In this paper, we propose a new statistical inference method for massive data sets, which is very simple and efficient by combining divide-and-conquer method and empirical likelihood. Compared with two popular methods (the bag of little…

Methodology · Statistics 2020-04-21 Xuejun Ma , Shaochen Wang , Wang Zhou

In this paper, we address the problem of conducting statistical inference in settings involving large-scale data that may be high-dimensional and contaminated by outliers. The high volume and dimensionality of the data require distributed…

Machine Learning · Statistics 2022-11-30 Emadaldin Mozafari-Majd , Visa Koivunen

We propose a distributed bootstrap method for simultaneous inference on high-dimensional massive data that are stored and processed with many machines. The method produces an $\ell_\infty$-norm confidence region based on a…

Methodology · Statistics 2022-06-15 Yang Yu , Shih-Kang Chao , Guang Cheng

The increased availability of massive data sets provides a unique opportunity to discover subtle patterns in their distributions, but also imposes overwhelming computational challenges. To fully utilize the information contained in big…

Statistics Theory · Mathematics 2018-04-12 Stanislav Volgushev , Shih-Kang Chao , Guang Cheng

This article introduces an iterative distributed computing estimator for the multinomial logistic regression model with large choice sets. Compared to the maximum likelihood estimator, the proposed iterative distributed estimator achieves…

Econometrics · Economics 2024-12-03 Yanqin Fan , Yigit Okar , Xuetao Shi

In multicenter research, individual-level data are often protected against sharing across sites. To overcome the barrier of data sharing, many distributed algorithms, which only require sharing aggregated information, have been developed.…

Methodology · Statistics 2021-03-25 Rui Duan , Yang Ning , Yong Chen

We consider statistical inference for a finite-dimensional parameter in a regular semiparametric model under a distributed setting with blockwise missingness, where entire blocks of variables are unavailable at certain sites and sharing…

Methodology · Statistics 2025-08-26 Jingyue Huang , Huiyuan Wang , Yuqing Lei , Yong Chen

It is not unusual for a data analyst to encounter data sets distributed across several computers. This can happen for reasons such as privacy concerns, efficiency of likelihood evaluations, or just the sheer size of the whole data set. This…

Computation · Statistics 2018-05-22 Randy C. S. Lai , J. Hannig , Thomas C. M. Lee

This paper presents a class of new algorithms for distributed statistical estimation that exploit divide-and-conquer approach. We show that one of the key benefits of the divide-and-conquer strategy is robustness, an important…

Statistics Theory · Mathematics 2018-08-29 Stanislav Minsker , Nate Strawn

We propose a bootstrap procedure for data that may exhibit clustering in two or more dimensions. We use insights from the theory of generalized U-statistics to analyze the large-sample properties of statistics that are sample averages from…

Methodology · Statistics 2017-12-06 Konrad Menzel

In multicenter biomedical research, integrating data from multiple decentralized sites provides more robust and generalizable findings due to its larger sample size and the ability to account for the between-site heterogeneity. However,…

Methodology · Statistics 2025-12-29 Xiaokang Liu , Yuchen Yang , Yifei Sun , Jiang Bian , Yanyuan Ma , Raymond J. Carroll , Yong Chen

We propose a distributed method for simultaneous inference for datasets with sample size much larger than the number of covariates, i.e., N >> p, in the generalized linear models framework. When such datasets are too big to be analyzed…

Methodology · Statistics 2020-07-23 Lu Tang , Ling Zhou , Peter X. -K. Song

The proliferation of science and technology has led to the prevalence of voluminous data sets that are distributed across multiple machines. It is an established fact that conventional statistical methodologies may be unfeasible in the…

Statistics Theory · Mathematics 2023-10-24 Lu Yan , Jiang Hu

This paper considers distributed M-estimation under heterogeneous distributions among distributed data blocks. A weighted distributed estimator is proposed to improve the efficiency of the standard "Split-And-Conquer" (SaC) estimator for…

Statistics Theory · Mathematics 2022-09-15 Jia Gu , Songxi Chen

When data are stored across multiple locations, directly pooling all the data together for statistical analysis may be impossible due to communication costs and privacy concerns. Distributed computing systems allow the analysis of such…

Methodology · Statistics 2025-02-27 Xian Li , Xuan Liang , A. H. Welsh , Tao Zou

In distributed, or privacy-preserving learning, we are often given a set of probabilistic models estimated from different local repositories, and asked to combine them into a single model that gives efficient statistical estimation. A…

Machine Learning · Statistics 2017-03-01 Jun Han , Qiang Liu

Estimating statistical models within sensor networks requires distributed algorithms, in which both data and computation are distributed across the nodes of the network. We propose a general approach for distributed learning based on…

Machine Learning · Computer Science 2012-07-03 Qiang Liu , Alexander Ihler
‹ Prev 1 2 3 10 Next ›