Related papers: Model-specific Data Subsampling with Influence Fun…

Optimal Sub-sampling with Influence Functions

Sub-sampling is a common and often effective method to deal with the computational challenges of large datasets. However, for most statistical models, there is no well-motivated approach for drawing a non-uniform subsample. We show that the…

Machine Learning · Statistics 2017-09-07 Daniel Ting , Eric Brochu

DsDm: Model-Aware Dataset Selection with Datamodels

When selecting data for training large-scale models, standard practice is to filter for examples that match human notions of data quality. Such filtering yields qualitatively clean datapoints that intuitively should improve model behavior.…

Machine Learning · Computer Science 2024-01-24 Logan Engstrom , Axel Feldmann , Aleksander Madry

Less Is Better: Unweighted Data Subsampling via Influence Function

In the time of Big Data, training complex models on large-scale data sets is challenging, making it appealing to reduce data volume for saving computation resources by subsampling. Most previous works in subsampling are weighted methods…

Machine Learning · Computer Science 2021-04-14 Zifeng Wang , Hong Zhu , Zhenhua Dong , Xiuqiang He , Shao-Lun Huang

Revisiting Data Attribution for Influence Functions

The goal of data attribution is to trace the model's predictions through the learning algorithm and back to its training data. thereby identifying the most influential training samples and understanding how the model's behavior leads to…

Machine Learning · Computer Science 2025-08-12 Hongbo Zhu , Angelo Cangelosi

Investigating the Impact of Data Selection Strategies on Language Model Performance

Data selection is critical for enhancing the performance of language models, particularly when aligning training datasets with a desired target distribution. This study explores the effects of different data selection methods and feature…

Computation and Language · Computer Science 2025-01-08 Jiayao Gu , Liting Chen , Yihong Li

A model robust sub-sampling approach for Generalised Linear Models in Big data settings

In today's modern era of Big data, computationally efficient and scalable methods are needed to support timely insights and informed decision making. One such method is sub-sampling, where a subset of the Big data is analysed and used as…

Methodology · Statistics 2022-09-07 Amalan Mahendran , Helen Thompson , James M. McGree

Accelerating Machine Learning Algorithms with Adaptive Sampling

The era of huge data necessitates highly efficient machine learning algorithms. Many common machine learning algorithms, however, rely on computationally intensive subroutines that are prohibitively expensive on large datasets. Oftentimes,…

Machine Learning · Computer Science 2023-09-26 Mo Tiwari

Cross-Loss Influence Functions to Explain Deep Network Representations

As machine learning is increasingly deployed in the real world, it is paramount that we develop the tools necessary to analyze the decision-making of the models we train and deploy to end-users. Recently, researchers have shown that…

Machine Learning · Computer Science 2022-05-05 Andrew Silva , Rohit Chopra , Matthew Gombolay

How to Achieve Higher Accuracy with Less Training Points?

In the era of large-scale model training, the extensive use of available datasets has resulted in significant computational inefficiencies. To tackle this issue, we explore methods for identifying informative subsets of training data that…

Machine Learning · Computer Science 2025-04-21 Jinghan Yang , Anupam Pani , Yunchao Zhang

Optimal subdata selection for linear model selection

If the assumed model does not accurately capture the underlying structure of the data, a statistical method is likely to yield sub-optimal results, and so model selection is crucial in order to conduct any statistical analysis. However, in…

Methodology · Statistics 2023-06-21 Vasilis Chasiotis , Dimitris Karlis

Model Selection Techniques -- An Overview

In the era of big data, analysts usually explore various statistical models or machine learning methods for observed data in order to facilitate scientific discoveries or gain predictive power. Whatever data and fitting procedures are…

Machine Learning · Statistics 2018-10-24 Jie Ding , Vahid Tarokh , Yuhong Yang

Active sampling: A machine-learning-assisted framework for finite population inference with optimal subsamples

Data subsampling has become widely recognized as a tool to overcome computational and economic bottlenecks in analyzing massive datasets. We contribute to the development of adaptive design for estimation of finite population…

Methodology · Statistics 2024-07-08 Henrik Imberg , Xiaomi Yang , Carol Flannagan , Jonas Bärgman

AutoSampling: Search for Effective Data Sampling Schedules

Data sampling acts as a pivotal role in training deep learning models. However, an effective sampling schedule is difficult to learn due to the inherently high dimension of parameters in learning the sampling schedule. In this paper, we…

Computer Vision and Pattern Recognition · Computer Science 2021-05-31 Ming Sun , Haoxuan Dou , Baopu Li , Lei Cui , Junjie Yan , Wanli Ouyang

Prediction-Oriented Subsampling from Data Streams

Data is often generated in streams, with new observations arriving over time. A key challenge for learning models from data streams is capturing relevant information while keeping computational costs manageable. We explore intelligent data…

Machine Learning · Computer Science 2025-12-23 Benedetta Lavinia Mussati , Freddie Bickford Smith , Tom Rainforth , Stephen Roberts

Meta-Learning for Resampling Recommendation Systems

One possible approach to tackle the class imbalance in classification tasks is to resample a training dataset, i.e., to drop some of its elements or to synthesize new ones. There exist several widely-used resampling methods. Recent research…

Machine Learning · Computer Science 2018-09-18 Smolyakov Dmitry , Alexander Korotin , Pavel Erofeev , Artem Papanov , Evgeny Burnaev

Training Data Influence Analysis and Estimation: A Survey

Good models require good training data. For overparameterized deep models, the causal relationship between training data and model predictions is increasingly opaque and poorly understood. Influence analysis partially demystifies training's…

Machine Learning · Computer Science 2024-04-02 Zayd Hammoudeh , Daniel Lowd

Sharing pattern submodels for prediction with missing values

Missing values are unavoidable in many applications of machine learning and present challenges both during training and at test time. When variables are missing in recurring patterns, fitting separate pattern submodels have been proposed as…

Machine Learning · Computer Science 2023-11-27 Lena Stempfle , Ashkan Panahi , Fredrik D. Johansson

Learning Sample-Specific Models with Low-Rank Personalized Regression

Modern applications of machine learning (ML) deal with increasingly heterogeneous datasets comprised of data collected from overlapping latent subpopulations. As a result, traditional models trained over large datasets may fail to recognize…

Machine Learning · Statistics 2019-10-16 Benjamin Lengerich , Bryon Aragam , Eric P. Xing

Understanding Data Influence with Differential Approximation

Data plays a pivotal role in the groundbreaking advancements in artificial intelligence. The quantitative analysis of data significantly contributes to model training, enhancing both the efficiency and quality of data utilization. However,…

Machine Learning · Computer Science 2025-08-21 Haoru Tan , Sitong Wu , Xiuzhe Wu , Wang Wang , Bo Zhao , Zeke Xie , Gui-Song Xia , Xiaojuan Qi

Demystifying statistical learning based on efficient influence functions

Evaluation of treatment effects and more general estimands is typically achieved via parametric modelling, which is unsatisfactory since model misspecification is likely. Data-adaptive model building (e.g. statistical/machine learning) is…

Statistics Theory · Mathematics 2022-01-14 Oliver Hines , Oliver Dukes , Karla Diaz-Ordaz , Stijn Vansteelandt