English
Related papers

Related papers: Local case-control sampling: Efficient subsampling…

200 papers

A major challenge for building statistical models in the big data era is that the available data volume far exceeds the computational capability. A common approach for solving this problem is to employ a subsampled dataset that can be…

Computation · Statistics 2018-09-14 Lei Han , Kean Ming Tan , Ting Yang , Tong Zhang

Case-control sampling is a commonly used retrospective sampling design to alleviate imbalanced structure of binary data. When fitting the logistic regression model with case-control data, although the slope parameter of the model can be…

Methodology · Statistics 2024-06-03 Hengchao Shi , Xinyi Liu , Ming Zheng , Wen Yu

Data rebalancing techniques, including oversampling and undersampling, are a common approach to addressing the challenges of imbalanced data. To tackle unresolved problems related to both oversampling and undersampling, we propose a new…

Machine Learning · Computer Science 2025-07-11 Karen Medlin , Sven Leyffer , Krishnan Raghavan

Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or oversampling the data…

Disordered Systems and Neural Networks · Physics 2025-02-03 Emanuele Loffredo , Mauro Pastore , Simona Cocco , Rémi Monasson

Class-imbalance is an inherent characteristic of multi-label data which affects the prediction accuracy of most multi-label learning methods. One efficient strategy to deal with this problem is to employ resampling techniques before…

Machine Learning · Computer Science 2021-05-18 Bin Liu , Grigorios Tsoumakas

Many real-world classification problems are significantly class-imbalanced to detriment of the class of interest. The standard set of proper evaluation metrics is well-known but the usual assumption is that the test dataset imbalance equals…

Machine Learning · Computer Science 2020-04-16 Jan Brabec , Tomáš Komárek , Vojtěch Franc , Lukáš Machlica

Class imbalanced datasets are common in real-world applications that range from credit card fraud detection to rare disease diagnostics. Several popular classification algorithms assume that classes are approximately balanced, and hence…

Machine Learning · Statistics 2018-09-10 Val Andrei Fajardo , David Findlay , Roshanak Houmanfar , Charu Jaiswal , Jiaxi Liang , Honglei Xie

Imbalanced regression occurs when continuous target variables have skewed distributions, creating sparse regions that are difficult for machine learning models to predict accurately. This issue particularly affects neural networks, which…

Machine Learning · Computer Science 2025-04-22 Shayan Alahyari , Mike Domaratzki

Supervised learning under measurement constraints is a common challenge in statistical and machine learning. In many applications, despite extensive design points, acquiring responses for all points is often impractical due to resource…

Methodology · Statistics 2025-03-19 Lin Wang

For massive data, the family of subsampling algorithms is popular to downsize the data volume and reduce computational burden. Existing studies focus on approximating the ordinary least squares estimate in linear regression, where…

Computation · Statistics 2019-06-27 HaiYing Wang , Rong Zhu , Ping Ma

Fithian and Hastie (2014) proposed a new sampling scheme called local case-control (LCC) sampling that achieves stability and efficiency by utilizing a clever adjustment pertained to the logistic model. It is particularly useful for…

Methodology · Statistics 2021-05-07 Xinwei Shen , Kani Chen , Wen Yu

In predictive tasks, real-world datasets often present different degrees of imbalanced (i.e., long-tailed or skewed) distributions. While the majority (the head) classes have sufficient samples, the minority (the tail) classes can be…

Machine Learning · Computer Science 2021-09-14 Chongsheng Zhang , Paolo Soda , Jingjun Bi , Gaojuan Fan , George Almpanidis , Salvador Garcia

An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed…

Artificial Intelligence · Computer Science 2011-11-25 N. V. Chawla , K. W. Bowyer , L. O. Hall , W. P. Kegelmeyer

Semi-supervised learning has received increasingly attention in statistics and machine learning. In semi-supervised learning settings, a labeled data set with both outcomes and covariates and an unlabeled data set with covariates only are…

Machine Learning · Statistics 2024-02-26 Zhuojun Quan , Yuanyuan Lin , Kani Chen , Wen Yu

In practice, machine learning experts are often confronted with imbalanced data. Without accounting for the imbalance, common classifiers perform poorly and standard evaluation metrics mislead the practitioners on the model's performance. A…

Machine Learning · Computer Science 2020-07-21 Ramiro Camino , Christian Hammerschmidt , Radu State

This paper studies binary logistic regression for rare events data, or imbalanced data, where the number of events (observations in one class, often called cases) is significantly smaller than the number of nonevents (observations in the…

Machine Learning · Statistics 2020-06-02 HaiYing Wang

A significant hurdle for analyzing large sample data is the lack of effective statistical computing and inference methods. An emerging powerful approach for analyzing large sample data is subsampling, by which one takes a random subsample…

Methodology · Statistics 2015-11-24 Rong Zhu , Ping Ma , Michael W. Mahoney , Bin Yu

Downsampling or under-sampling is a technique that is utilized in the context of large and highly imbalanced classification models. We study optimal downsampling for imbalanced classification using generalized linear models (GLMs). We…

Machine Learning · Statistics 2025-05-20 Yan Chen , Jose Blanchet , Krzysztof Dembczynski , Laura Fee Nern , Aaron Flores

Prior work has shown that Visual Recognition datasets frequently underrepresent bias groups $B$ (\eg Female) within class labels $Y$ (\eg Programmers). This dataset bias can lead to models that learn spurious correlations between class…

Computer Vision and Pattern Recognition · Computer Science 2023-04-28 Maan Qraitem , Kate Saenko , Bryan A. Plummer

In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation,…

Machine Learning · Computer Science 2022-01-21 Mohamed S. Kraiem , Fernando Sánchez-Hernández , María N. Moreno-García
‹ Prev 1 2 3 10 Next ›