Related papers: Data Budgeting for Machine Learning

A Data Management Approach for Dataset Selection Using Human Computation

As the number of applications that use machine learning algorithms increases, the need for labeled data useful for training such algorithms intensifies. Getting labels typically involves employing humans to do the annotation, which directly…

Machine Learning · Computer Science 2013-07-16 Alexandros Ntoulas , Omar Alonso , Vasilis Kandylas

The Economics of AI Training Data: A Research Agenda

Despite data's central role in AI production, it remains the least understood input. As AI labs exhaust public data and turn to proprietary sources, with deals reaching hundreds of millions of dollars, research across computer science,…

Computers and Society · Computer Science 2026-04-28 Hamidah Oderinwale , Anna Kazlauskas

Budget-Constrained Tool Learning with Planning

Despite intensive efforts devoted to tool learning, the problem of budget-constrained tool learning, which focuses on resolving user queries within a specific budget constraint, has been widely overlooked. This paper proposes a novel method…

Artificial Intelligence · Computer Science 2024-06-12 Yuanhang Zheng , Peng Li , Ming Yan , Ji Zhang , Fei Huang , Yang Liu

Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints

In most practical settings and theoretical analyses, one assumes that a model can be trained until convergence. However, the growing complexity of machine learning datasets and models may violate such assumptions. Indeed, current approaches…

Computer Vision and Pattern Recognition · Computer Science 2020-07-01 Mengtian Li , Ersin Yumer , Deva Ramanan

How Much More Data Do I Need? Estimating Requirements for Downstream Tasks

Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation or test performance? This question is of critical importance in applications such as autonomous driving or medical…

Computer Vision and Pattern Recognition · Computer Science 2022-07-14 Rafid Mahmood , James Lucas , David Acuna , Daiqing Li , Jonah Philion , Jose M. Alvarez , Zhiding Yu , Sanja Fidler , Marc T. Law

Compute-Constrained Data Selection

Data selection can reduce the amount of training data needed to finetune LLMs; however, the efficacy of data selection scales directly with its compute. Motivated by the practical challenge of compute-constrained finetuning, we consider the…

Machine Learning · Computer Science 2025-04-09 Junjie Oscar Yin , Alexander M. Rush

Low-Cost Learning via Active Data Procurement

We design mechanisms for online procurement of data held by strategic agents for machine learning tasks. The challenge is to use past data to actively price future data and give learning guarantees even when an agent's cost for revealing…

Computer Science and Game Theory · Computer Science 2015-06-09 Jacob Abernethy , Yiling Chen , Chien-Ju Ho , Bo Waggoner

Addressing Budget Allocation and Revenue Allocation in Data Market Environments Using an Adaptive Sampling Algorithm

High-quality machine learning models are dependent on access to high-quality training data. When the data are not already available, it is tedious and costly to obtain them. Data markets help with identifying valuable training data: model…

Machine Learning · Computer Science 2023-06-06 Boxin Zhao , Boxiang Lyu , Raul Castro Fernandez , Mladen Kolar

The Effects of Data Quality on Machine Learning Performance on Tabular Data

Modern artificial intelligence (AI) applications require large quantities of training and test data. This need creates critical challenges not only concerning the availability of such data, but also regarding its quality. For example,…

Databases · Computer Science 2025-05-15 Sedir Mohammed , Lukas Budach , Moritz Feuerpfeil , Nina Ihde , Andrea Nathansen , Nele Noack , Hendrik Patzlaff , Felix Naumann , Hazar Harmouch

Labels or Preferences? Budget-Constrained Learning with Human Judgments over AI-Generated Outputs

The increasing reliance on human preference feedback to judge AI-generated pseudo labels has created a pressing need for principled, budget-conscious data acquisition strategies. We address the crucial question of how to optimally allocate…

Machine Learning · Statistics 2026-02-13 Zihan Dong , Xiaotian Hou , Ruijia Wu , Linjun Zhang

Learning Aggregation Rules in Participatory Budgeting: A Data-Driven Approach

Participatory Budgeting (PB) offers a democratic process for communities to allocate public funds across various projects through voting. In practice, PB organizers face challenges in selecting aggregation rules either because they are not…

Machine Learning · Computer Science 2024-12-04 Roy Fairstein , Dan Vilenchik , Kobi Gal

Q-Sat AI: Machine Learning-Based Decision Support for Data Saturation in Qualitative Studies

The determination of sample size in qualitative research has traditionally relied on the subjective and often ambiguous principle of data saturation, which can lead to inconsistencies and threaten methodological rigor. This study introduces…

Machine Learning · Computer Science 2025-12-10 Hasan Tutar , Caner Erden , Ümit Şentürk

Fix your Models by Fixing your Datasets

The quality of underlying training data is very crucial for building performant machine learning models with wider generalizabilty. However, current machine learning (ML) tools lack streamlined processes for improving the data quality. So,…

Machine Learning · Computer Science 2021-12-16 Atindriyo Sanyal , Vikram Chatterji , Nidhi Vyas , Ben Epstein , Nikita Demir , Anthony Corletti

Deep Active Learning with Budget Annotation

Digital data collected over the decades and data currently being produced with use of information technology is vastly the unlabeled data or data without description. The unlabeled data is relatively easy to acquire but expensive to label…

Machine Learning · Computer Science 2022-08-02 Kinyua Gikunda

Bandit Data-Driven Optimization

Applications of machine learning in the non-profit and public sectors often feature an iterative workflow of data acquisition, prediction, and optimization of interventions. There are four major pain points that a machine learning pipeline…

Machine Learning · Computer Science 2022-01-19 Zheyuan Ryan Shi , Zhiwei Steven Wu , Rayid Ghani , Fei Fang

Budget Learning via Bracketing

Conventional machine learning applications in the mobile/IoT setting transmit data to a cloud-server for predictions. Due to cost considerations (power, latency, monetary), it is desirable to minimise device-to-server transmissions. The…

Machine Learning · Computer Science 2020-04-15 Aditya Gangrade , Durmus Alp Emre Acar , Venkatesh Saligrama

AI Competitions and Benchmarks: Dataset Development

Machine learning is now used in many applications thanks to its ability to predict, generate, or discover patterns from large quantities of data. However, the process of collecting and transforming data for practical use is intricate. Even…

Machine Learning · Computer Science 2024-04-16 Romain Egele , Julio C. S. Jacques Junior , Jan N. van Rijn , Isabelle Guyon , Xavier Baró , Albert Clapés , Prasanna Balaprakash , Sergio Escalera , Thomas Moeslund , Jun Wan

Navigating Data Corruption in Machine Learning: Balancing Quality, Quantity, and Imputation Strategies

Data corruption, including missing and noisy data, poses significant challenges in real-world machine learning. This study investigates the effects of data corruption on model performance and explores strategies to mitigate these effects…

Machine Learning · Computer Science 2025-05-22 Qi Liu , Wanjing Ma

Label Budget Allocation in Multi-Task Learning

The cost of labeling data often limits the performance of machine learning systems. In multi-task learning, related tasks provide information to each other and improve overall performance, but the label cost can vary among tasks. How should…

Machine Learning · Computer Science 2023-08-25 Ximeng Sun , Kihyuk Sohn , Kate Saenko , Clayton Mellina , Xiao Bian

DsDm: Model-Aware Dataset Selection with Datamodels

When selecting data for training large-scale models, standard practice is to filter for examples that match human notions of data quality. Such filtering yields qualitatively clean datapoints that intuitively should improve model behavior.…

Machine Learning · Computer Science 2024-01-24 Logan Engstrom , Axel Feldmann , Aleksander Madry