Related papers: Optimal Ratio for Data Splitting

SPlit: An Optimal Method for Data Splitting

In this article we propose an optimal method referred to as SPlit for splitting a dataset into training and testing sets. SPlit is based on the method of Support Points (SP), which was initially developed for finding the optimal…

Machine Learning · Statistics 2021-05-10 V. Roshan Joseph , Akhil Vakayil

Partial Resampling of Imbalanced Data

Imbalanced data is a frequently encountered problem in machine learning. Despite a vast amount of literature on sampling techniques for imbalanced data, there is a limited number of studies that address the issue of the optimal sampling…

Machine Learning · Computer Science 2022-07-12 Firuz Kamalov , Amir F. Atiya , Dina Elreedy

Does Data Splitting Improve Prediction?

Data splitting divides data into two parts. One part is reserved for model selection. In some applications, the second part is used for model validation but we use this part for estimating the parameters of the chosen model. We focus on the…

Methodology · Statistics 2016-01-20 Julian J. Faraway

Test Set Sizing for the Ridge Regression

We derive the ideal train/test split for the ridge regression to high accuracy in the limit that the number of training rows m becomes large. The split must depend on the ridge tuning parameter, alpha, but we find that the dependence is…

Machine Learning · Statistics 2025-09-08 Alexander Dubbs

Data splitting improves statistical performance in overparametrized regimes

While large training datasets generally offer improvement in model performance, the training process becomes computationally expensive and time consuming. Distributed learning is a common strategy to reduce the overall training time by…

Machine Learning · Statistics 2021-10-22 Nicole Mücke , Enrico Reiss , Jonas Rungenhagen , Markus Klein

Test Set Sizing Via Random Matrix Theory

This paper uses techniques from Random Matrix Theory to find the ideal training-testing data split for a simple linear regression with m data points, each an independent n-dimensional multivariate Gaussian. It defines "ideal" as satisfying…

Machine Learning · Statistics 2022-07-26 Alexander Dubbs

Using the Distribution of Performance for Studying Statistical NLP Systems and Corpora

Statistical NLP systems are frequently evaluated and compared on the basis of their performances on a single split of training and test data. Results obtained using a single split are, however, subject to sampling noise. In this paper we…

Computation and Language · Computer Science 2007-05-23 Yuval Krymolowski

On the Optimality of Averaging in Distributed Statistical Learning

A common approach to statistical learning with big-data is to randomly split it among $m$ machines and learn the parameter of interest by averaging the $m$ individual estimates. In this paper, focusing on empirical risk minimization, or…

Machine Learning · Statistics 2016-06-14 Jonathan Rosenblatt , Boaz Nadler

Learning where to learn: Training data distribution optimization for scientific machine learning

In scientific machine learning, models are routinely deployed with parameter values or boundary conditions far from those used in training. This paper studies the learning-where-to-learn problem of designing a training data distribution…

Machine Learning · Computer Science 2025-12-09 Nicolas Guerra , Nicholas H. Nelsen , Yunan Yang

Balanced Split: A new train-test data splitting strategy for imbalanced datasets

Classification data sets with skewed class proportions are called imbalanced. Class imbalance is a problem since most machine learning classification algorithms are built with an assumption of equal representation of all classes in the…

Machine Learning · Computer Science 2022-12-22 Azal Ahmad Khan

Predicting Regression Probability Distributions with Imperfect Data Through Optimal Transformations

The goal of regression analysis is to predict the value of a numeric outcome variable y given a vector of joint values of other (predictor) variables x. Usually a particular x-vector does not specify a repeatable value for y, but rather a…

Machine Learning · Statistics 2020-01-29 Jerome H. Friedman

Optimal Data Splitting in Distributed Optimization for Machine Learning

The distributed optimization problem has become increasingly relevant recently. It has a lot of advantages such as processing a large amount of data in less time compared to non-distributed methods. However, most distributed approaches…

Optimization and Control · Mathematics 2024-03-27 Daniil Medyakov , Gleb Molodtsov , Aleksandr Beznosikov , Alexander Gasnikov

Beyond Random Split for Assessing Statistical Model Performance

Even though a train/test split of the dataset randomly performed is a common practice, could not always be the best approach for estimating performance generalization under some scenarios. The fact is that the usual machine learning…

Machine Learning · Computer Science 2022-09-09 Carlos Catania , Jorge Guerra , Juan Manuel Romero , Gabriel Caffaratti , Martin Marchetta

Optimal Splitting of Language Models from Mixtures to Specialized Domains

Language models achieve impressive performance on a variety of knowledge, language, and reasoning tasks due to the scale and diversity of pretraining data available. The standard training recipe is a two-stage paradigm: pretraining first on…

Computation and Language · Computer Science 2026-03-20 Skyler Seto , Pierre Ablin , Anastasiia Filippova , Jiayuan Ye , Louis Bethune , Angelos Katharopoulos , David Grangier

Mixing Deep Learning and Multiple Criteria Optimization: An Application to Distributed Learning with Multiple Datasets

The training phase is the most important stage during the machine learning process. In the case of labeled data and supervised learning, machine training consists in minimizing the loss function subject to different constraints. In an…

Machine Learning · Computer Science 2021-12-03 Davide La Torre , Danilo Liuzzi , Marco Repetto , Matteo Rocca

Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction

For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the training examples and/or the computational costs associated with…

Artificial Intelligence · Computer Science 2011-06-24 F. Provost , G. M. Weiss

Training and Testing with Multiple Splits: A Central Limit Theorem for Split-Sample Estimators

As predictive algorithms grow in popularity, using the same dataset to both train and test a new model has become routine across research, policy, and industry. Sample-splitting attains valid inference on model properties by using separate…

Econometrics · Economics 2025-11-27 Bruno Fava

Data Twinning

In this work, we develop a method named Twinning, for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model-independent method for optimally splitting a dataset into training and…

Machine Learning · Statistics 2022-02-17 Akhil Vakayil , V. Roshan Joseph

Recommending Training Set Sizes for Classification

Based on a comprehensive study of 20 established data sets, we recommend training set sizes for any classification data set. We obtain our recommendations by systematically withholding training data and developing models through five…

Machine Learning · Computer Science 2021-02-19 Phillip Koshute , Jared Zook , Ian McCulloh

Constructing Decision Trees from Data Streams

In this work, we present data stream algorithms to compute optimal splits for decision tree learning. In particular, given a data stream of observations $x_i$ and their corresponding labels $y_i$, without the i.i.d. assumption, the…

Data Structures and Algorithms · Computer Science 2025-04-18 Huy Pham , Hoang Ta , Hoa T. Vu