Related papers: Does Data Splitting Improve Prediction?

SPlit: An Optimal Method for Data Splitting

In this article we propose an optimal method referred to as SPlit for splitting a dataset into training and testing sets. SPlit is based on the method of Support Points (SP), which was initially developed for finding the optimal…

Machine Learning · Statistics 2021-05-10 V. Roshan Joseph , Akhil Vakayil

Exploring Data Splitting Strategies for the Evaluation of Recommendation Models

Effective methodologies for evaluating recommender systems are critical, so that such systems can be compared in a sound manner. A commonly overlooked aspect of recommender system evaluation is the selection of the data splitting strategy.…

Information Retrieval · Computer Science 2020-07-28 Zaiqiao Meng , Richard McCreadie , Craig Macdonald , Iadh Ounis

Sample Splitting as an M-Estimator with Application to Physical Activity Scoring

Sample splitting is widely used in statistical applications, including classically in classification and more recently for inference post model selection. Motivating by problems in the study of diet, physical activity, and health, we…

Methodology · Statistics 2019-08-13 Eli S. Kravitz , Raymond J. Carroll , David Ruppert

Training and Testing with Multiple Splits: A Central Limit Theorem for Split-Sample Estimators

As predictive algorithms grow in popularity, using the same dataset to both train and test a new model has become routine across research, policy, and industry. Sample-splitting attains valid inference on model properties by using separate…

Econometrics · Economics 2025-11-27 Bruno Fava

Beyond Random Split for Assessing Statistical Model Performance

Even though a train/test split of the dataset randomly performed is a common practice, could not always be the best approach for estimating performance generalization under some scenarios. The fact is that the usual machine learning…

Machine Learning · Computer Science 2022-09-09 Carlos Catania , Jorge Guerra , Juan Manuel Romero , Gabriel Caffaratti , Martin Marchetta

Data splitting improves statistical performance in overparametrized regimes

While large training datasets generally offer improvement in model performance, the training process becomes computationally expensive and time consuming. Distributed learning is a common strategy to reduce the overall training time by…

Machine Learning · Statistics 2021-10-22 Nicole Mücke , Enrico Reiss , Jonas Rungenhagen , Markus Klein

Estimation from Partially Sampled Distributed Traces

Sampling is often a necessary evil to reduce the processing and storage costs of distributed tracing. In this work, we describe a scalable and adaptive sampling approach that can preserve events of interest better than the widely used…

Data Structures and Algorithms · Computer Science 2021-07-19 Otmar Ertl

Evaluating A/B Testing Methodologies via Sample Splitting: Theory and Practice

We develop a theoretical framework for sample splitting in A/B testing environments, where data for each test are partitioned into two splits to measure methodological performance when the true impacts of tests are unobserved. We show that…

Econometrics · Economics 2026-03-24 Ryan Kessler , James McQueen , Miikka Rokkanen

Privacy and Efficiency of Communications in Federated Split Learning

Everyday, large amounts of sensitive data is distributed across mobile phones, wearable devices, and other sensors. Traditionally, these enormous datasets have been processed on a single system, with complex models being trained to make…

Machine Learning · Computer Science 2023-01-10 Zongshun Zhang , Andrea Pinto , Valeria Turina , Flavio Esposito , Ibrahim Matta

Optimal Ratio for Data Splitting

It is common to split a dataset into training and testing sets before fitting a statistical or machine learning model. However, there is no clear guidance on how much data should be used for training and testing. In this article we show…

Machine Learning · Statistics 2022-06-10 V. Roshan Joseph

To Split or Not to Split: The Impact of Disparate Treatment in Classification

Disparate treatment occurs when a machine learning model yields different decisions for individuals based on a sensitive attribute (e.g., age, sex). In domains where prediction accuracy is paramount, it could potentially be acceptable to…

Machine Learning · Computer Science 2022-04-15 Hao Wang , Hsiang Hsu , Mario Diaz , Flavio P. Calmon

Data Twinning

In this work, we develop a method named Twinning, for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model-independent method for optimally splitting a dataset into training and…

Machine Learning · Statistics 2022-02-17 Akhil Vakayil , V. Roshan Joseph

Time to Split: Exploring Data Splitting Strategies for Offline Evaluation of Sequential Recommenders

Modern sequential recommender systems, ranging from lightweight transformer-based variants to large language models, have become increasingly prominent in academia and industry due to their strong performance in the next-item prediction…

Information Retrieval · Computer Science 2025-08-11 Danil Gusak , Anna Volodkevich , Anton Klenitskiy , Alexey Vasilev , Evgeny Frolov

Split Regression Modeling

Sparse methods are the standard approach to obtain interpretable models with high prediction accuracy. Alternatively, algorithmic ensemble methods can achieve higher prediction accuracy at the cost of loss of interpretability. However, the…

Methodology · Statistics 2022-01-11 Anthony Christidis , Stefan Van Aelst , Ruben Zamar

A note on data splitting with e-values: online appendix to my comment on Glenn Shafer's "Testing by betting"

This note reanalyzes Cox's idealized example of testing with data splitting using e-values (Shafer's betting scores). Cox's exciting finding was that the method of data splitting, while allowing flexible data analysis, achieves quite high…

Methodology · Statistics 2020-08-27 Vladimir Vovk

Optimal Data Split Methodology for Model Validation

The decision to incorporate cross-validation into validation processes of mathematical models raises an immediate question - how should one partition the data into calibration and validation sets? We answer this question systematically: we…

Data Analysis, Statistics and Probability · Physics 2011-08-31 Rebecca Morrison , Corey Bryant , Gabriel Terejanu , Kenji Miki , Serge Prudhomme

Automated Data Slicing for Model Validation:A Big data - AI Integration Approach

As machine learning systems become democratized, it becomes increasingly important to help users easily debug their models. However, current data tools are still primitive when it comes to helping users trace model performance problems all…

Databases · Computer Science 2019-01-08 Yeounoh Chung , Tim Kraska , Neoklis Polyzotis , Ki Hyun Tae , Steven Euijong Whang

The Data Addition Dilemma

In many machine learning for healthcare tasks, standard datasets are constructed by amassing data across many, often fundamentally dissimilar, sources. But when does adding more data help, and when does it hinder progress on desired model…

Machine Learning · Computer Science 2024-08-09 Judy Hanwen Shen , Inioluwa Deborah Raji , Irene Y. Chen

Data fission: splitting a single data point

Suppose we observe a random vector $X$ from some distribution $P$ in a known family with unknown parameters. We ask the following question: when is it possible to split $X$ into two parts $f(X)$ and $g(X)$ such that neither part is…

Methodology · Statistics 2023-12-12 James Leiner , Boyan Duan , Larry Wasserman , Aaditya Ramdas

A Stochastic Large-scale Machine Learning Algorithm for Distributed Features and Observations

As the size of modern data sets exceeds the disk and memory capacities of a single computer, machine learning practitioners have resorted to parallel and distributed computing. Given that optimization is one of the pillars of machine…

Machine Learning · Statistics 2019-12-10 Biyi Fang , Diego Klabjan