Related papers: Optimal Downsampling for Imbalanced Classification…
A significant hurdle for analyzing large sample data is the lack of effective statistical computing and inference methods. An emerging powerful approach for analyzing large sample data is subsampling, by which one takes a random subsample…
Subsampling algorithms for various parametric regression models with massive data have been extensively investigated in recent years. However, all existing studies on subsampling heavily rely on clean massive data. In practical…
In practice, machine learning experts are often confronted with imbalanced data. Without accounting for the imbalance, common classifiers perform poorly and standard evaluation metrics mislead the practitioners on the model's performance. A…
Big data is ubiquitous in practices, and it has also led to heavy computation burden. To reduce the calculation cost and ensure the effectiveness of parameter estimators, an optimal subset sampling method is proposed to estimate the…
Sub-sampling is a common and often effective method to deal with the computational challenges of large datasets. However, for most statistical models, there is no well-motivated approach for drawing a non-uniform subsample. We show that the…
Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or oversampling the data…
For massive data, the family of subsampling algorithms is popular to downsize the data volume and reduce computational burden. Existing studies focus on approximating the ordinary least squares estimate in linear regression, where…
Imbalanced Learning is an important learning algorithm for the classification models, which have enjoyed much popularity on many applications. Typically, imbalanced learning algorithms can be partitioned into two types, i.e., data level…
To fast approximate maximum likelihood estimators with massive data, this paper studies the Optimal Subsampling Method under the A-optimality Criterion (OSMAC) for generalized linear models. The consistency and asymptotic normality of the…
Subsampling is one of the popular methods to balance statistical efficiency and computational efficiency in the big data era. Most approaches aim at selecting informative or representative sample points to achieve good overall information…
Imbalanced classification and spurious correlation are common challenges in data science and machine learning. Both issues are linked to data imbalance, with certain groups of data samples significantly underrepresented, which in turn would…
Subsampling is an efficient method to deal with massive data. In this paper, we investigate the optimal subsampling for linear quantile regression when the covariates are functions. The asymptotic distribution of the subsampling estimator…
A major challenge for building statistical models in the big data era is that the available data volume far exceeds the computational capability. A common approach for solving this problem is to employ a subsampled dataset that can be…
Data rebalancing techniques, including oversampling and undersampling, are a common approach to addressing the challenges of imbalanced data. To tackle unresolved problems related to both oversampling and undersampling, we propose a new…
Subsampling is a computationally efficient and scalable method to draw inference in large data settings based on a subset of the data rather than needing to consider the whole dataset. When employing subsampling techniques, a crucial…
While a broad range of techniques have been proposed to tackle distribution shift, the simple baseline of training on an $\textit{undersampled}$ balanced dataset often achieves close to state-of-the-art-accuracy across several popular…
Despite extensive research spanning several decades, class imbalance is still considered a profound difficulty for both machine learning and deep learning models. While data oversampling is the foremost technique to address this issue,…
Subsampling methods aim to select a subsample as a surrogate for the observed sample. As a powerful technique for large-scale data analysis, various subsampling methods are developed for more effective coefficient estimation and model…
Oversampling is one of the most widely used approaches for addressing imbalanced classification. The core idea is to generate additional minority samples to rebalance the dataset. Most existing methods, such as SMOTE, require converting…
Class imbalance problem is commonly faced while developing machine learning models for real-life issues. Due to this problem, the fitted model tends to be biased towards the majority class data, which leads to lower precision, recall, AUC,…