Related papers: Finding High-Value Training Data Subset through Di…

Convex Dataset Valuation for Post-Training

Improving LLM performance on downstream tasks sometimes requires leveraging auxiliary datasets during post-training. In practice, however, developers face constraints on compute, labeling, and licensing costs that preclude using all…

Machine Learning · Computer Science 2026-05-19 Siqi Zeng , Christopher Jung , Rui Li , Zhe Kang , Ming Li , Nima Noorshams , Zhigang Wang , Fuchun Peng , Han Zhao , Xue Feng

Adaptive Data Selection for Multi-Layer Perceptron Training: A Sub-linear Value-Driven Method

Data selection is one of the fundamental problems in neural network training, particularly for multi-layer perceptrons (MLPs) where identifying the most valuable training samples from massive, multi-source, and heterogeneous data sources…

Machine Learning · Computer Science 2025-10-27 Xiyang Zhang , Chen Liang , Haoxuan Qiu , Hongzhi Wang

DCNNs on a Diet: Sampling Strategies for Reducing the Training Set Size

Large-scale supervised classification algorithms, especially those based on deep convolutional neural networks (DCNNs), require vast amounts of training data to achieve state-of-the-art performance. Decreasing this data requirement would…

Computer Vision and Pattern Recognition · Computer Science 2016-06-15 Maya Kabkab , Azadeh Alavi , Rama Chellappa

Convex Online Video Frame Subset Selection using Multiple Criteria for Data Efficient Autonomous Driving

Training vision-based Urban Autonomous driving models is a challenging problem, which is highly researched in recent times. Training such models is a data-intensive task requiring the storage and processing of vast volumes of (possibly…

Machine Learning · Computer Science 2021-03-25 Soumi Das , Harikrishna Patibandla , Suparna Bhattacharya , Kshounis Bera , Niloy Ganguly , Sourangshu Bhattacharya

Deep Learning with Differential Privacy

Machine learning techniques based on neural networks are achieving remarkable results in a wide variety of domains. Often, the training of models requires large, representative datasets, which may be crowdsourced and contain sensitive…

Machine Learning · Statistics 2018-12-21 Martín Abadi , Andy Chu , Ian Goodfellow , H. Brendan McMahan , Ilya Mironov , Kunal Talwar , Li Zhang

DsDm: Model-Aware Dataset Selection with Datamodels

When selecting data for training large-scale models, standard practice is to filter for examples that match human notions of data quality. Such filtering yields qualitatively clean datapoints that intuitively should improve model behavior.…

Machine Learning · Computer Science 2024-01-24 Logan Engstrom , Axel Feldmann , Aleksander Madry

CheckSel: Efficient and Accurate Data-valuation Through Online Checkpoint Selection

Data valuation and subset selection have emerged as valuable tools for application-specific selection of important training data. However, the efficiency-accuracy tradeoffs of state-of-the-art methods hinder their widespread application to…

Machine Learning · Computer Science 2022-03-15 Soumi Das , Manasvi Sagarkar , Suparna Bhattacharya , Sourangshu Bhattacharya

On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular Functions

Modern datasets span billions of samples, making training on all available data infeasible. Selecting a high quality subset helps in reducing training costs and enhancing model quality. Submodularity, a discrete analogue of convexity, is…

Machine Learning · Computer Science 2025-04-04 Maximilian Böther , Abraham Sebastian , Pranjal Awasthi , Ana Klimovic , Srikumar Ramalingam

Exploring Data Redundancy in Real-world Image Classification through Data Selection

Deep learning models often require large amounts of data for training, leading to increased costs. It is particularly challenging in medical imaging, i.e., gathering distributed data for centralized training, and meanwhile, obtaining…

Computer Vision and Pattern Recognition · Computer Science 2023-06-27 Zhenyu Tang , Shaoting Zhang , Xiaosong Wang

Sub-Setting Algorithm for Training Data Selection in Pattern Recognition

Modern pattern recognition tasks use complex algorithms that take advantage of large datasets to make more accurate predictions than traditional algorithms such as decision trees or k-nearest-neighbor better suited to describe simple…

Machine Learning · Statistics 2021-10-14 AGaurav Arwade , Sigurdur Olafsson

Beyond Discrete Selection: Continuous Embedding Space Optimization for Generative Feature Selection

The goal of Feature Selection - comprising filter, wrapper, and embedded approaches - is to find the optimal feature subset for designated downstream tasks. Nevertheless, current feature selection methods are limited by: 1) the selection…

Machine Learning · Computer Science 2023-09-18 Meng Xiao , Dongjie Wang , Min Wu , Pengfei Wang , Yuanchun Zhou , Yanjie Fu

Dataset Condensation with Gradient Matching

As the state-of-the-art machine learning methods in many fields rely on larger datasets, storing datasets and training models on them become significantly more expensive. This paper proposes a training set synthesis technique for…

Computer Vision and Pattern Recognition · Computer Science 2021-03-09 Bo Zhao , Konda Reddy Mopuri , Hakan Bilen

Same accuracy, twice as fast: continuous training surpasses retraining from scratch

Continual learning aims to enable models to adapt to new datasets without losing performance on previously learned data, often assuming that prior data is no longer available. However, in many practical scenarios, both old and new data are…

Machine Learning · Computer Science 2025-03-03 Eli Verwimp , Guy Hacohen , Tinne Tuytelaars

Efficient Parameter Sampling for Neural Network Construction

The customizable nature of deep learning models have allowed them to be successful predictors in various disciplines. These models are often trained with respect to thousands or millions of instances for complicated problems, but the…

Machine Learning · Computer Science 2019-12-24 Drimik Roy Chowdhury , Muhammad Firmansyah Kasim

Data Dropout: Optimizing Training Data for Convolutional Neural Networks

Deep learning models learn to fit training data while they are highly expected to generalize well to testing data. Most works aim at finding such models by creatively designing architectures and fine-tuning parameters. To adapt to…

Computer Vision and Pattern Recognition · Computer Science 2018-09-10 Tianyang Wang , Jun Huan , Bo Li

Achieving Representative Data via Convex Hull Feasibility Sampling Algorithms

Sampling biases in training data are a major source of algorithmic biases in machine learning systems. Although there are many methods that attempt to mitigate such algorithmic biases during training, the most direct and obvious way is…

Machine Learning · Statistics 2022-04-15 Laura Niss , Yuekai Sun , Ambuj Tewari

A Review of Meta-level Learning in the Context of Multi-component, Multi-level Evolving Prediction Systems

The exponential growth of volume, variety and velocity of data is raising the need for investigations of automated or semi-automated ways to extract useful patterns from the data. It requires deep expert knowledge and extensive…

Machine Learning · Computer Science 2020-07-22 Abbas Raza Ali , Marcin Budka , Bogdan Gabrys

Optimizing Data Usage via Differentiable Rewards

To acquire a new skill, humans learn better and faster if a tutor, based on their current knowledge level, informs them of how much attention they should pay to particular content or practice problems. Similarly, a machine learning model…

Machine Learning · Computer Science 2021-06-18 Xinyi Wang , Hieu Pham , Paul Michel , Antonios Anastasopoulos , Jaime Carbonell , Graham Neubig

Are All Training Examples Created Equal? An Empirical Study

Modern computer vision algorithms often rely on very large training datasets. However, it is conceivable that a carefully selected subsample of the dataset is sufficient for training. In this paper, we propose a gradient-based importance…

Machine Learning · Computer Science 2018-12-03 Kailas Vodrahalli , Ke Li , Jitendra Malik

Learning From Less Data: A Unified Data Subset Selection and Active Learning Framework for Computer Vision

Supervised machine learning based state-of-the-art computer vision techniques are in general data hungry. Their data curation poses the challenges of expensive human labeling, inadequate computing resources and larger experiment turn around…

Computer Vision and Pattern Recognition · Computer Science 2019-01-07 Vishal Kaushal , Rishabh Iyer , Suraj Kothawade , Rohan Mahadev , Khoshrav Doctor , Ganesh Ramakrishnan