Related papers: NesPrInDT: Nested undersampling in PrInDT
In this paper, we extend our PrInDT method (Weihs & Buschfeld 2021a) towards undersampling with different percentages of the smaller and the larger classes (psmall and plarge), stratification of predictors, varying the prediction threshold,…
In many real-world binary classification tasks (e.g. detection of certain objects from images), an available dataset is imbalanced, i.e., it has much less representatives of a one class (a minor class), than of another. Generally, accurate…
Class imbalance problem is commonly faced while developing machine learning models for real-life issues. Due to this problem, the fitted model tends to be biased towards the majority class data, which leads to lower precision, recall, AUC,…
Downsampling or under-sampling is a technique that is utilized in the context of large and highly imbalanced classification models. We study optimal downsampling for imbalanced classification using generalized linear models (GLMs). We…
This study is about inducing classifiers using data that is imbalanced, with a minority class being under-represented in relation to the majority classes. The first section of this research focuses on the main characteristics of data that…
A learning classifier must outperform a trivial solution, in case of imbalanced data, this condition usually does not hold true. To overcome this problem, we propose a novel data level resampling method - Clustering Based Oversampling for…
Data rebalancing techniques, including oversampling and undersampling, are a common approach to addressing the challenges of imbalanced data. To tackle unresolved problems related to both oversampling and undersampling, we propose a new…
A number of classification problems need to deal with data imbalance between classes. Often it is desired to have a high recall on the minority class while maintaining a high precision on the majority class. In this paper, we review a…
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation,…
In practice, machine learning experts are often confronted with imbalanced data. Without accounting for the imbalance, common classifiers perform poorly and standard evaluation metrics mislead the practitioners on the model's performance. A…
For the last two decades, oversampling has been employed to overcome the challenge of learning from imbalanced datasets. Many approaches to solving this challenge have been offered in the literature. Oversampling, on the other hand, is a…
Supervised learning under measurement constraints is a common challenge in statistical and machine learning. In many applications, despite extensive design points, acquiring responses for all points is often impractical due to resource…
Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or oversampling the data…
In the time of Big Data, training complex models on large-scale data sets is challenging, making it appealing to reduce data volume for saving computation resources by subsampling. Most previous works in subsampling are weighted methods…
Learning from an imbalanced dataset is a tricky proposition. Because these datasets are biased towards one class, most existing classifiers tend not to perform well on minority class examples. Conventional classifiers usually aim to…
Imbalanced datasets present a significant challenge for machine learning models, often leading to biased predictions. To address this issue, data augmentation techniques are widely used in natural language processing (NLP) to generate new…
In this paper, we show that conditional inference trees and ensembles are suitable methods for modeling linguistic variation. As against earlier linguistic applications, however, we claim that their suitability is strongly increased if we…
Data imbalance is common in production data, where controlled production settings require data to fall within a narrow range of variation and data are collected with quality assessment in mind, rather than data analytic insights. This…
Recent advances in Large Language Models (LLMs) have significantly reshaped the landscape of Natural Language Processing (NLP). Among the various prompting techniques, few-shot prompting has gained considerable attention for its…
Class imbalance and distributional differences in large datasets present significant challenges for classification tasks machine learning, often leading to biased models and poor predictive performance for minority classes. This work…