Related papers: SPlit: An Optimal Method for Data Splitting

Data Twinning

In this work, we develop a method named Twinning, for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model-independent method for optimally splitting a dataset into training and…

Machine Learning · Statistics 2022-02-17 Akhil Vakayil , V. Roshan Joseph

Optimal Ratio for Data Splitting

It is common to split a dataset into training and testing sets before fitting a statistical or machine learning model. However, there is no clear guidance on how much data should be used for training and testing. In this article we show…

Machine Learning · Statistics 2022-06-10 V. Roshan Joseph

Does Data Splitting Improve Prediction?

Data splitting divides data into two parts. One part is reserved for model selection. In some applications, the second part is used for model validation but we use this part for estimating the parameters of the chosen model. We focus on the…

Methodology · Statistics 2016-01-20 Julian J. Faraway

EvoSplit: An evolutionary approach to split a multi-label data set into disjoint subsets

This paper presents a new evolutionary approach, EvoSplit, for the distribution of multi-label data sets into disjoint subsets for supervised machine learning. Currently, data set providers either divide a data set randomly or using…

Machine Learning · Computer Science 2021-03-24 Francisco Florez-Revuelta

We propose a Similarity-Based Stratified Splitting (SBSS) technique, which uses both the output and input space information to split the data. The splits are generated using similarity functions among samples to place similar samples in…

Machine Learning · Computer Science 2020-10-14 Felipe Farias , Teresa Ludermir , Carmelo Bastos-Filho

Constructing Decision Trees from Data Streams

In this work, we present data stream algorithms to compute optimal splits for decision tree learning. In particular, given a data stream of observations \(x_i\) and their corresponding labels \(y_i\), without the i.i.d. assumption, the…

Data Structures and Algorithms · Computer Science 2025-04-18 Huy Pham , Hoang Ta , Hoa T. Vu

Training and Testing with Multiple Splits: A Central Limit Theorem for Split-Sample Estimators

As predictive algorithms grow in popularity, using the same dataset to both train and test a new model has become routine across research, policy, and industry. Sample-splitting attains valid inference on model properties by using separate…

Econometrics · Economics 2025-11-27 Bruno Fava

A Random Sample Partition Data Model for Big Data Analysis

Big data sets must be carefully partitioned into statistically similar data subsets that can be used as representative samples for big data analysis tasks. In this paper, we propose the random sample partition (RSP) data model to represent…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-06-11 Salman Salloum , Yulin He , Joshua Zhexue Huang , Xiaoliang Zhang , Tamer Z. Emara , Chenghao Wei , Heping He

Optimal subsampling algorithm for the marginal model with large longitudinal data

Big data is ubiquitous in practices, and it has also led to heavy computation burden. To reduce the calculation cost and ensure the effectiveness of parameter estimators, an optimal subset sampling method is proposed to estimate the…

Methodology · Statistics 2023-11-16 Haohui Han , Liya Fu

Split Conformal Classification with Unsupervised Calibration

Methods for split conformal prediction leverage calibration samples to transform any prediction rule into a set-prediction rule that complies with a target coverage probability. Existing methods provide remarkably strong performance…

Machine Learning · Statistics 2025-10-15 Santiago Mazuelas

Improving optimal subsampling through stratification

Recent works have proposed optimal subsampling algorithms to improve computational efficiency in large datasets and to design validation studies in the presence of measurement error. Existing approaches generally fall into two categories:…

Methodology · Statistics 2025-12-25 Jasper B. Yang , Thomas Lumley , Bryan E. Shepherd , Pamela A. Shaw

Beyond Random Split for Assessing Statistical Model Performance

Even though a train/test split of the dataset randomly performed is a common practice, could not always be the best approach for estimating performance generalization under some scenarios. The fact is that the usual machine learning…

Machine Learning · Computer Science 2022-09-09 Carlos Catania , Jorge Guerra , Juan Manuel Romero , Gabriel Caffaratti , Martin Marchetta

Optimal Data Splitting in Distributed Optimization for Machine Learning

The distributed optimization problem has become increasingly relevant recently. It has a lot of advantages such as processing a large amount of data in less time compared to non-distributed methods. However, most distributed approaches…

Optimization and Control · Mathematics 2024-03-27 Daniil Medyakov , Gleb Molodtsov , Aleksandr Beznosikov , Alexander Gasnikov

Learning to Split for Automatic Bias Detection

Classifiers are biased when trained on biased datasets. As a remedy, we propose Learning to Split (ls), an algorithm for automatic bias detection. Given a dataset with input-label pairs, ls learns to split this dataset so that predictors…

Machine Learning · Computer Science 2022-07-22 Yujia Bao , Regina Barzilay

Adaptive Split Learning over Energy-Constrained Wireless Edge Networks

Split learning (SL) is a promising approach for training artificial intelligence (AI) models, in which devices collaborate with a server to train an AI model in a distributed manner, based on a same fixed split point. However, due to the…

Machine Learning · Computer Science 2025-03-14 Zuguang Li , Wen Wu , Shaohua Wu , Wei Wang

Splitting method for spatio-temporal search efforts planning

This article deals with the spatio-temporal sensors deployment in order to maximize detection probability of an intelligent and randomly moving target in an area under surveillance. Our work is based on the rare events simulation framework.…

Neural and Evolutionary Computing · Computer Science 2017-02-24 Chouchane Mathieu , Paris Sébastien , Le Gland François , Ouladsine Mustapha

Optimizing Split Points for Error-Resilient SplitFed Learning

Recent advancements in decentralized learning, such as Federated Learning (FL), Split Learning (SL), and Split Federated Learning (SplitFed), have expanded the potentials of machine learning. SplitFed aims to minimize the computational…

Artificial Intelligence · Computer Science 2024-05-31 Chamani Shiranthika , Parvaneh Saeedi , Ivan V. Bajić

Efficient and Private Approximations of Distributed Databases Calculations

In recent years, an increasing amount of data is collected in different and often, not cooperative, databases. The problem of privacy-preserving, distributed calculations over separated databases and, a relative to it, issue of private data…

Databases · Computer Science 2016-05-23 Philip Derbeko , Shlomi Dolev , Ehud Gudes , Jeffrey D. Ullman

Data splitting improves statistical performance in overparametrized regimes

While large training datasets generally offer improvement in model performance, the training process becomes computationally expensive and time consuming. Distributed learning is a common strategy to reduce the overall training time by…

Machine Learning · Statistics 2021-10-22 Nicole Mücke , Enrico Reiss , Jonas Rungenhagen , Markus Klein

Split Regression Modeling

Sparse methods are the standard approach to obtain interpretable models with high prediction accuracy. Alternatively, algorithmic ensemble methods can achieve higher prediction accuracy at the cost of loss of interpretability. However, the…

Methodology · Statistics 2022-01-11 Anthony Christidis , Stefan Van Aelst , Ruben Zamar