Related papers: Local case-control sampling: Efficient subsampling…

Local Uncertainty Sampling for Large-Scale Multi-Class Logistic Regression

A major challenge for building statistical models in the big data era is that the available data volume far exceeds the computational capability. A common approach for solving this problem is to employ a subsampled dataset that can be…

Computation · Statistics 2018-09-14 Lei Han , Kean Ming Tan , Ting Yang , Tong Zhang

Statistical inference for case-control logistic regression via integrating external summary data

Case-control sampling is a commonly used retrospective sampling design to alleviate imbalanced structure of binary data. When fitting the logistic regression model with case-control data, although the slope parameter of the model can be…

Methodology · Statistics 2024-06-03 Hengchao Shi , Xinyi Liu , Ming Zheng , Wen Yu

A Bilevel Optimization Framework for Imbalanced Data Classification

Data rebalancing techniques, including oversampling and undersampling, are a common approach to addressing the challenges of imbalanced data. To tackle unresolved problems related to both oversampling and undersampling, we propose a new…

Machine Learning · Computer Science 2025-07-11 Karen Medlin , Sven Leyffer , Krishnan Raghavan

Restoring balance: principled under/oversampling of data for optimal classification

Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or oversampling the data…

Disordered Systems and Neural Networks · Physics 2025-02-03 Emanuele Loffredo , Mauro Pastore , Simona Cocco , Rémi Monasson

Synthetic Oversampling of Multi-Label Data based on Local Label Distribution

Class-imbalance is an inherent characteristic of multi-label data which affects the prediction accuracy of most multi-label learning methods. One efficient strategy to deal with this problem is to employ resampling techniques before…

Machine Learning · Computer Science 2021-05-18 Bin Liu , Grigorios Tsoumakas

On Model Evaluation under Non-constant Class Imbalance

Many real-world classification problems are significantly class-imbalanced to detriment of the class of interest. The standard set of proper evaluation metrics is well-known but the usual assumption is that the test dataset imbalance equals…

Machine Learning · Computer Science 2020-04-16 Jan Brabec , Tomáš Komárek , Vojtěch Franc , Lukáš Machlica

VOS: a Method for Variational Oversampling of Imbalanced Data

Class imbalanced datasets are common in real-world applications that range from credit card fraud detection to rare disease diagnostics. Several popular classification algorithms assume that classes are approximately balanced, and hence…

Machine Learning · Statistics 2018-09-10 Val Andrei Fajardo , David Findlay , Roshanak Houmanfar , Charu Jaiswal , Jiaxi Liang , Honglei Xie

Local distribution-based adaptive oversampling for imbalanced regression

Imbalanced regression occurs when continuous target variables have skewed distributions, creating sparse regions that are difficult for machine learning models to predict accurately. This issue particularly affects neural networks, which…

Machine Learning · Computer Science 2025-04-22 Shayan Alahyari , Mike Domaratzki

Balanced Subsampling for Big Data with Categorical Covariates

Supervised learning under measurement constraints is a common challenge in statistical and machine learning. In many applications, despite extensive design points, acquiring responses for all points is often impractical due to resource…

Methodology · Statistics 2025-03-19 Lin Wang

Optimal Subsampling for Large Sample Logistic Regression

For massive data, the family of subsampling algorithms is popular to downsize the data volume and reduce computational burden. Existing studies focus on approximating the ordinary least squares estimate in linear regression, where…

Computation · Statistics 2019-06-27 HaiYing Wang , Rong Zhu , Ping Ma

Surprise sampling: improving and extending the local case-control sampling

Fithian and Hastie (2014) proposed a new sampling scheme called local case-control (LCC) sampling that achieves stability and efficiency by utilizing a clever adjustment pertained to the logistic model. It is particularly useful for…

Methodology · Statistics 2021-05-07 Xinwei Shen , Kani Chen , Wen Yu

An Empirical Study on the Joint Impact of Feature Selection and Data Re-sampling on Imbalance Classification

In predictive tasks, real-world datasets often present different degrees of imbalanced (i.e., long-tailed or skewed) distributions. While the majority (the head) classes have sufficient samples, the minority (the tail) classes can be…

Machine Learning · Computer Science 2021-09-14 Chongsheng Zhang , Paolo Soda , Jingjun Bi , Gaojuan Fan , George Almpanidis , Salvador Garcia

SMOTE: Synthetic Minority Over-sampling Technique

An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed…

Artificial Intelligence · Computer Science 2011-11-25 N. V. Chawla , K. W. Bowyer , L. O. Hall , W. P. Kegelmeyer

Efficient semi-supervised inference for logistic regression under case-control studies

Semi-supervised learning has received increasingly attention in statistics and machine learning. In semi-supervised learning settings, a labeled data set with both outcomes and covariates and an unlabeled data set with covariates only are…

Machine Learning · Statistics 2024-02-26 Zhuojun Quan , Yuanyuan Lin , Kani Chen , Wen Yu

Minority Class Oversampling for Tabular Data with Deep Generative Models

In practice, machine learning experts are often confronted with imbalanced data. Without accounting for the imbalance, common classifiers perform poorly and standard evaluation metrics mislead the practitioners on the model's performance. A…

Machine Learning · Computer Science 2020-07-21 Ramiro Camino , Christian Hammerschmidt , Radu State

Logistic Regression for Massive Data with Rare Events

This paper studies binary logistic regression for rare events data, or imbalanced data, where the number of events (observations in one class, often called cases) is significantly smaller than the number of nonevents (observations in the…

Machine Learning · Statistics 2020-06-02 HaiYing Wang

Optimal Subsampling Approaches for Large Sample Linear Regression

A significant hurdle for analyzing large sample data is the lack of effective statistical computing and inference methods. An emerging powerful approach for analyzing large sample data is subsampling, by which one takes a random subsample…

Methodology · Statistics 2015-11-24 Rong Zhu , Ping Ma , Michael W. Mahoney , Bin Yu

Optimal Downsampling for Imbalanced Classification with Generalized Linear Models

Downsampling or under-sampling is a technique that is utilized in the context of large and highly imbalanced classification models. We study optimal downsampling for imbalanced classification using generalized linear models (GLMs). We…

Machine Learning · Statistics 2025-05-20 Yan Chen , Jose Blanchet , Krzysztof Dembczynski , Laura Fee Nern , Aaron Flores

Bias Mimicking: A Simple Sampling Approach for Bias Mitigation

Prior work has shown that Visual Recognition datasets frequently underrepresent bias groups $B$ (\eg Female) within class labels $Y$ (\eg Programmers). This dataset bias can lead to models that learn spurious correlations between class…

Computer Vision and Pattern Recognition · Computer Science 2023-04-28 Maan Qraitem , Kate Saenko , Bryan A. Plummer

Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties

In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation,…

Machine Learning · Computer Science 2022-01-21 Mohamed S. Kraiem , Fernando Sánchez-Hernández , María N. Moreno-García