Related papers: Sample Size Planning for Classification Models

Logistic regression models for patient-level prediction based on massive observational data: Do we need all data?

Objective: Provide guidance on sample size considerations for developing predictive models by empirically establishing the adequate sample size, which balances the competing objectives of improving model performance and reducing model…

Applications · Statistics 2024-07-25 Luis H. John , Jan A. Kors , Jenna M. Reps , Patrick B. Ryan , Peter R. Rijnbeek

Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance

When solving NLP tasks with limited labelled data, researchers typically either use a general large language model without further update, or use a small number of labelled samples to tune a specialised smaller model. In this work, we…

Computation and Language · Computer Science 2026-01-26 Branislav Pecher , Ivan Srba , Maria Bielikova

Exploring the Impact of Dataset Statistical Effect Size on Model Performance and Data Sample Size Sufficiency

Having a sufficient quantity of quality data is a critical enabler of training effective machine learning models. Being able to effectively determine the adequacy of a dataset prior to training and evaluating a model's performance would be…

Machine Learning · Computer Science 2026-04-28 Arya Hatamian , Lionel Levine , Haniyeh Ehsani Oskouie , Majid Sarrafzadeh

Optimizing Sample Size for Supervised Machine Learning with Bulk Transcriptomic Sequencing: A Learning Curve Approach

Accurate sample classification using transcriptomics data is crucial for advancing personalized medicine. Achieving this goal necessitates determining a suitable sample size that ensures adequate statistical power without undue resource…

Methodology · Statistics 2024-09-11 Yunhui Qi , Xinyi Wang , Li-Xuan Qin

Strategies and impact of learning curve estimation for CNN-based image classification

Learning curves are a measure for how the performance of machine learning models improves given a certain volume of training data. Over a wide variety of applications and models it was observed that learning curves follow -- to a large…

Machine Learning · Computer Science 2023-10-13 Laura Didyk , Brayden Yarish , Michael A. Beck , Christopher P. Bidinosti , Christopher J. Henry

Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction

For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the training examples and/or the computational costs associated with…

Artificial Intelligence · Computer Science 2011-06-24 F. Provost , G. M. Weiss

Sample size estimation for power and accuracy in the experimental comparison of algorithms

Experimental comparisons of performance represent an important aspect of research on optimization algorithms. In this work we present a methodology for defining the required sample sizes for designing experiments with desired statistical…

Neural and Evolutionary Computing · Computer Science 2018-10-16 Felipe Campelo , Fernanda Takahashi

Classifier Risk Estimation under Limited Labeling Resources

In this paper we propose strategies for estimating performance of a classifier when labels cannot be obtained for the whole test set. The number of test instances which can be labeled is very small compared to the whole test data size. The…

Machine Learning · Computer Science 2018-02-21 Anurag Kumar , Bhiksha Raj

How much data do you need? Part 2: Predicting DL class specific training dataset sizes

This paper targets the question of predicting machine learning classification model performance, when taking into account the number of training examples per class and not just the overall number of training examples. This leads to the a…

Machine Learning · Computer Science 2024-03-12 Thomas Mühlenstädt , Jelena Frtunikj

Sample size determination for training cancer classifiers from microarray and RNA-seq data

The objective of many high-dimensional microarray and RNA-seq studies is to develop a classifier of cancer patients based on characteristics of their disease. The germinal center B-cell (GCB) classifier study in lymphoma and the National…

Applications · Statistics 2015-09-17 Sandra Safo , Xiao Song , Kevin K. Dobbin

Semisupervised Classifier Evaluation and Recalibration

How many labeled examples are needed to estimate a classifier's performance on a new dataset? We study the case where data is plentiful, but labels are expensive. We show that by making a few reasonable assumptions on the structure of the…

Machine Learning · Computer Science 2012-10-09 Peter Welinder , Max Welling , Pietro Perona

Free Lunch for Few-shot Learning: Distribution Calibration

Learning from a limited number of samples is challenging since the learned model can easily become overfitted based on the biased distribution formed by only a few training examples. In this paper, we calibrate the distribution of these…

Machine Learning · Computer Science 2021-08-17 Shuo Yang , Lu Liu , Min Xu

How many images do I need? Understanding how sample size per class affects deep learning model performance metrics for balanced designs in autonomous wildlife monitoring

Deep learning (DL) algorithms are the state of the art in automated classification of wildlife camera trap images. The challenge is that the ecologist cannot know in advance how many images per species they need to collect for model…

Computer Vision and Pattern Recognition · Computer Science 2020-10-19 Saleh Shahinfar , Paul Meek , Greg Falzon

An Empirical Analysis of the Efficacy of Different Sampling Techniques for Imbalanced Classification

Learning from imbalanced data is a challenging task. Standard classification algorithms tend to perform poorly when trained on imbalanced data. Some special strategies need to be adopted, either by modifying the data distribution or by…

Machine Learning · Computer Science 2022-08-26 Asif Newaz , Shahriar Hassan , Farhan Shahriyar Haq

Better Classifier Calibration for Small Data Sets

Classifier calibration does not always go hand in hand with the classifier's ability to separate the classes. There are applications where good classifier calibration, i.e. the ability to produce accurate probability estimates, is more…

Machine Learning · Computer Science 2020-05-26 Tuomo Alasalmi , Jaakko Suutala , Heli Koskimäki , Juha Röning

Recommending Training Set Sizes for Classification

Based on a comprehensive study of 20 established data sets, we recommend training set sizes for any classification data set. We obtain our recommendations by systematically withholding training data and developing models through five…

Machine Learning · Computer Science 2021-02-19 Phillip Koshute , Jared Zook , Ian McCulloh

Class Mean Vectors, Self Monitoring and Self Learning for Neural Classifiers

In this paper we explore the role of sample mean in building a neural network for classification. This role is surprisingly extensive and includes: direct computation of weights without training, performance monitoring for samples without…

Machine Learning · Computer Science 2019-10-23 Eugene Wong

A decomposition of Fisher's information to inform sample size for developing fair and precise clinical prediction models -- part 1: binary outcomes

When developing a clinical prediction model, the sample size of the development dataset is a key consideration. Small sample sizes lead to greater concerns of overfitting, instability, poor performance and lack of fairness. Previous…

Methodology · Statistics 2025-01-27 Richard D Riley , Gary S Collins , Rebecca Whittle , Lucinda Archer , Kym IE Snell , Paula Dhiman , Laura Kirton , Amardeep Legha , Xiaoxuan Liu , Alastair Denniston , Frank E Harrell , Laure Wynants , Glen P Martin , Joie Ensor

Feature Selection from High-Dimensional Data with Very Low Sample Size: A Cautionary Tale

In classification problems, the purpose of feature selection is to identify a small, highly discriminative subset of the original feature set. In many applications, the dataset may have thousands of features and only a few dozens of samples…

Machine Learning · Computer Science 2020-08-28 Ludmila I. Kuncheva , Clare E. Matthews , Álvar Arnaiz-González , Juan J. Rodríguez

Impact of Strategic Sampling and Supervision Policies on Semi-supervised Learning

In semi-supervised representation learning frameworks, when the number of labelled data is very scarce, the quality and representativeness of these samples become increasingly important. Existing literature on semi-supervised learning…

Computer Vision and Pattern Recognition · Computer Science 2024-11-05 Shuvendu Roy , Ali Etemad