Related papers: Sample Efficient Model Evaluation

Low-Shot Validation: Active Importance Sampling for Estimating Classifier Performance on Rare Categories

For machine learning models trained with limited labeled training data, validation stands to become the main bottleneck to reducing overall annotation costs. We propose a statistical validation algorithm that accurately estimates the…

Computer Vision and Pattern Recognition · Computer Science 2021-09-14 Fait Poms , Vishnu Sarukkai , Ravi Teja Mullapudi , Nimit S. Sohoni , William R. Mark , Deva Ramanan , Kayvon Fatahalian

Impact of Strategic Sampling and Supervision Policies on Semi-supervised Learning

In semi-supervised representation learning frameworks, when the number of labelled data is very scarce, the quality and representativeness of these samples become increasingly important. Existing literature on semi-supervised learning…

Computer Vision and Pattern Recognition · Computer Science 2024-11-05 Shuvendu Roy , Ali Etemad

Sampling with replacement vs Poisson sampling: a comparative study in optimal subsampling

Faced with massive data, subsampling is a commonly used technique to improve computational efficiency, and using nonuniform subsampling probabilities is an effective approach to improve estimation efficiency. For computational efficiency,…

Statistics Theory · Mathematics 2022-05-19 Jing Wang , Jiahui Zou , HaiYing Wang

Enhanced Sample Selection with Confidence Tracking: Identifying Correctly Labeled yet Hard-to-Learn Samples in Noisy Data

We propose a novel sample selection method for image classification in the presence of noisy labels. Existing methods typically consider small-loss samples as correctly labeled. However, some correctly labeled samples are inherently…

Computer Vision and Pattern Recognition · Computer Science 2025-04-25 Weiran Pan , Wei Wei , Feida Zhu , Yong Deng

Sample Selection with Uncertainty of Losses for Learning with Noisy Labels

In learning with noisy labels, the sample selection approach is very popular, which regards small-loss data as correctly labeled during training. However, losses are generated on-the-fly based on the model being trained with noisy labels,…

Machine Learning · Computer Science 2021-06-02 Xiaobo Xia , Tongliang Liu , Bo Han , Mingming Gong , Jun Yu , Gang Niu , Masashi Sugiyama

Combining Self-labeling with Selective Sampling

Since data is the fuel that drives machine learning models, and access to labeled data is generally expensive, semi-supervised methods are constantly popular. They enable the acquisition of large datasets without the need for too many…

Machine Learning · Computer Science 2023-01-12 Jędrzej Kozal , Michał Woźniak

On missing label patterns in semi-supervised learning

We investigate model based classification with partially labelled training data. In many biostatistical applications, labels are manually assigned by experts, who may leave some observations unlabelled due to class uncertainty. We analyse…

Methodology · Statistics 2019-04-08 Daniel Ahfock , Geoffrey J. McLachlan

Classifier Risk Estimation under Limited Labeling Resources

In this paper we propose strategies for estimating performance of a classifier when labels cannot be obtained for the whole test set. The number of test instances which can be labeled is very small compared to the whole test data size. The…

Machine Learning · Computer Science 2018-02-21 Anurag Kumar , Bhiksha Raj

Significance Analysis of High-Dimensional, Low-Sample Size Partially Labeled Data

Classification and clustering are both important topics in statistical learning. A natural question herein is whether predefined classes are really different from one another, or whether clusters are really there. Specifically, we may be…

Machine Learning · Statistics 2015-09-22 Qiyi Lu , Xingye Qiao

Evaluating multiple models using labeled and unlabeled data

It remains difficult to evaluate machine learning classifiers in the absence of a large, labeled dataset. While labeled data can be prohibitively expensive or impossible to obtain, unlabeled data is plentiful. Here, we introduce…

Machine Learning · Computer Science 2025-10-15 Divya Shanmugam , Shuvom Sadhuka , Manish Raghavan , John Guttag , Bonnie Berger , Emma Pierson

A robust approach to model-based classification based on trimming and constraints

In a standard classification framework a set of trustworthy learning data are employed to build a decision rule, with the final aim of classifying unlabelled units belonging to the test set. Therefore, unreliable labelled observations,…

Applications · Statistics 2019-11-20 Andrea Cappozzo , Francesca Greselin , Thomas Brendan Murphy

Comparing the Value of Labeled and Unlabeled Data in Method-of-Moments Latent Variable Estimation

Labeling data for modern machine learning is expensive and time-consuming. Latent variable models can be used to infer labels from weaker, easier-to-acquire sources operating on unlabeled data. Such models can also be trained using labeled…

Machine Learning · Computer Science 2021-03-05 Mayee F. Chen , Benjamin Cohen-Wang , Stephen Mussmann , Frederic Sala , Christopher Ré

Label-Efficient Monitoring of Classification Models via Stratified Importance Sampling

Monitoring the performance of classification models in production is critical yet challenging due to strict labeling budgets, one-shot batch acquisition of labels and extremely low error rates. We propose a general framework based on…

Machine Learning · Computer Science 2026-02-02 Lupo Marsigli , Angel Lopez de Haro

All models are wrong, some are useful: Model Selection with Limited Labels

We introduce MODEL SELECTOR, a framework for label-efficient selection of pretrained classifiers. Given a pool of unlabeled target data, MODEL SELECTOR samples a small subset of highly informative examples for labeling, in order to…

Machine Learning · Computer Science 2024-10-28 Patrik Okanovic , Andreas Kirsch , Jannes Kasper , Torsten Hoefler , Andreas Krause , Nezihe Merve Gürel

Near Optimal Stratified Sampling

The performance of a machine learning system is usually evaluated by using i.i.d.\ observations with true labels. However, acquiring ground truth labels is expensive, while obtaining unlabeled samples may be cheaper. Stratified sampling can…

Machine Learning · Computer Science 2019-07-29 Tiancheng Yu , Xiyu Zhai , Suvrit Sra

Machine Learning from Explanations

Acquiring and training on large-scale labeled data can be impractical due to cost constraints. Additionally, the use of small training datasets can result in considerable variability in model outcomes, overfitting, and learning of spurious…

Machine Learning · Computer Science 2025-07-08 Jiashu Tao , Reza Shokri

Learning the Structure of Generative Models without Labeled Data

Curating labeled training data has become the primary bottleneck in machine learning. Recent frameworks address this bottleneck with generative models to synthesize labels at scale from weak supervision sources. The generative model's…

Machine Learning · Computer Science 2017-09-12 Stephen H. Bach , Bryan He , Alexander Ratner , Christopher Ré

Late Stopping: Avoiding Confidently Learning from Mislabeled Examples

Sample selection is a prevalent method in learning with noisy labels, where small-loss data are typically considered as correctly labeled data. However, this method may not effectively identify clean hard examples with large losses, which…

Machine Learning · Computer Science 2023-08-29 Suqin Yuan , Lei Feng , Tongliang Liu

Get the Most out of Your Sample: Optimal Unbiased Estimators using Partial Information

Random sampling is an essential tool in the processing and transmission of data. It is used to summarize data too large to store or manipulate and meet resource constraints on bandwidth or battery power. Estimators that are applied to the…

Databases · Computer Science 2015-03-19 Edith Cohen , Haim Kaplan

Informative missingness and its implications in semi-supervised learning

Semi-supervised learning (SSL) constructs classifiers using both labelled and unlabelled data. It leverages information from labelled samples, whose acquisition is often costly or labour-intensive, together with unlabelled data to enhance…

Machine Learning · Statistics 2025-12-29 Jinran Wu , You-Gan Wang , Geoffrey J. McLachlan