Related papers: Subbagging Variable Selection for Big Data

On the Subbagging Estimation for Massive Data

This article introduces subbagging (subsample aggregating) estimation approaches for big data analysis with memory constraints of computers. Specifically, for the whole dataset with size $N$, $m_N$ subsamples are randomly drawn, and each…

Methodology · Statistics 2021-03-05 Tao Zou , Xian Li , Xuan Liang , Hansheng Wang

Scalable subsampling: computation, aggregation and inference

Subsampling is a general statistical method developed in the 1990s aimed at estimating the sampling distribution of a statistic $\hat \theta _n$ in order to conduct nonparametric inference such as the construction of confidence intervals…

Statistics Theory · Mathematics 2021-12-14 Dimitris N. Politis

Maximum-Variance-Reduction Stratification for Improved Subsampling

Subsampling is a widely used and effective approach for addressing the computational challenges posed by massive datasets. Substantial progress has been made in developing non-uniform, probability-based subsampling schemes that prioritize…

Methodology · Statistics 2026-05-07 Dingyi Wang , Haiying Wang , Qingpei Hu

Optimal Subsampling Approaches for Large Sample Linear Regression

A significant hurdle for analyzing large sample data is the lack of effective statistical computing and inference methods. An emerging powerful approach for analyzing large sample data is subsampling, by which one takes a random subsample…

Methodology · Statistics 2015-11-24 Rong Zhu , Ping Ma , Michael W. Mahoney , Bin Yu

A subsampling approach for large data sets when the Generalised Linear Model is potentially misspecified

Subsampling is a computationally efficient and scalable method to draw inference in large data settings based on a subset of the data rather than needing to consider the whole dataset. When employing subsampling techniques, a crucial…

Methodology · Statistics 2025-10-08 Amalan Mahendran , Helen Thompson , James M. McGree

A Characterization of Mean Squared Error for Estimator with Bagging

Bagging can significantly improve the generalization performance of unstable machine learning algorithms such as trees or neural networks. Though bagging is now widely used in practice and many empirical studies have explored its behavior,…

Machine Learning · Computer Science 2019-08-08 Martin Mihelich , Charles Dognin , Yan Shu , Michael Blot

A replica analysis of under-bagging

Under-bagging (UB), which combines under-sampling and bagging, is a popular ensemble learning method for training classifiers on an imbalanced data. Using bagging to reduce the increased variance caused by the reduction in sample size due…

Machine Learning · Statistics 2025-05-19 Takashi Takahashi

Efficient subsampling for high-dimensional data

In the field of big data analytics, the search for efficient subdata selection methods that enable robust statistical inferences with minimal computational resources is of high importance. A procedure prior to subdata selection could…

Methodology · Statistics 2024-11-12 Vasilis Chasiotis , Lin Wang , Dimitris Karlis

Reducing Sampling Ratios Improves Bagging in Sparse Regression

Bagging, a powerful ensemble method from machine learning, improves the performance of unstable predictors. Although the power of Bagging has been shown mostly in classification problems, we demonstrate the success of employing Bagging in…

Machine Learning · Statistics 2019-05-03 Luoluo Liu , Sang Peter Chin , Trac D. Tran

The Loss Rank Criterion for Variable Selection in Linear Regression Analysis

Lasso and other regularization procedures are attractive methods for variable selection, subject to a proper choice of shrinkage parameter. Given a set of potential subsets produced by a regularization algorithm, a consistent model…

Methodology · Statistics 2014-02-26 Minh-Ngoc Tran

Towards a statistical theory of data selection under weak supervision

Given a sample of size $N$, it is often useful to select a subsample of smaller size $n<N$ to be used for statistical estimation or learning. Such a data selection step is useful to reduce the requirements of data labeling and the…

Machine Learning · Statistics 2023-10-05 Germain Kolossov , Andrea Montanari , Pulkit Tandon

Subdata selection for big data regression: an improved approach

In the big data era researchers face a series of problems. Even standard approaches/methodologies, like linear regression, can be difficult or problematic with huge volumes of data. Traditional approaches for regression in big datasets may…

Methodology · Statistics 2024-11-13 Vasilis Chasiotis , Dimitris Karlis

Bagging in overparameterized learning: Risk characterization and risk monotonization

Bagging is a commonly used ensemble technique in statistics and machine learning to improve the performance of prediction procedures. In this paper, we study the prediction risk of variants of bagged predictors under the proportional…

Statistics Theory · Mathematics 2023-10-26 Pratik Patil , Jin-Hong Du , Arun Kumar Kuchibhotla

Subset Selection for Multiple Linear Regression via Optimization

Subset selection in multiple linear regression aims to choose a subset of candidate explanatory variables that tradeoff fitting error (explanatory power) and model complexity (number of variables selected). We build mathematical programming…

Machine Learning · Statistics 2020-09-04 Young Woong Park , Diego Klabjan

Subsampling for Big Data Linear Models with Measurement Errors

Subsampling algorithms for various parametric regression models with massive data have been extensively investigated in recent years. However, all existing studies on subsampling heavily rely on clean massive data. In practical…

Statistics Theory · Mathematics 2025-06-11 Jiangshan Ju , Mingqiu Wang , Shengli Zhao

Precise Asymptotics of Bagging Regularized M-estimators

We characterize the squared prediction risk of ensemble estimators obtained through subagging (subsample bootstrap aggregating) regularized M-estimators and construct a consistent estimator for the risk. Specifically, we consider a…

Statistics Theory · Mathematics 2025-09-30 Takuya Koriyama , Pratik Patil , Jin-Hong Du , Kai Tan , Pierre C. Bellec

Orthogonal Subsampling for Big Data Linear Regression

The dramatic growth of big datasets presents a new challenge to data storage and analysis. Data reduction, or subsampling, that extracts useful information from datasets is a crucial step in big data analysis. We propose an orthogonal…

Methodology · Statistics 2021-06-01 Lin Wang , Jake Elmstedt , Weng Kee Wong , Hongquan Xu

Randomized maximum-contrast selection: subagging for large-scale regression

We introduce a very general method for sparse and large-scale variable selection. The large-scale regression settings is such that both the number of parameters and the number of samples are extremely large. The proposed method is based on…

Statistics Theory · Mathematics 2019-07-31 Jelena Bradic

Support vector machines with a reject option

This paper studies $\ell_1$ regularization with high-dimensional features for support vector machines with a built-in reject option (meaning that the decision of classifying an observation can be withheld at a cost lower than that of…

Statistics Theory · Mathematics 2012-01-06 Marten Wegkamp , Ming Yuan

On the asymptotic properties of a bagging estimator with a massive dataset

Bagging is a useful method for large-scale statistical analysis, especially when the computing resources are very limited. We study here the asymptotic properties of bagging estimators for $M$-estimation problems but with massive datasets.…

Statistics Theory · Mathematics 2023-04-14 Yuan Gao , Riquan Zhang , Hansheng Wang