Related papers: Quantifying Inherent Randomness in Machine Learnin…

On The Impact of Machine Learning Randomness on Group Fairness

Statistical measures for group fairness in machine learning reflect the gap in performance of algorithms across different groups. These measures, however, exhibit a high variance between different training instances, which makes them…

Machine Learning · Computer Science 2023-07-11 Prakhar Ganesh , Hongyan Chang , Martin Strobel , Reza Shokri

Variance of ML-based software fault predictors: are we really improving fault prediction?

Software quality assurance activities become increasingly difficult as software systems become more and more complex and continuously grow in size. Moreover, testing becomes even more expensive when dealing with large-scale systems. Thus,…

Software Engineering · Computer Science 2023-10-27 Xhulja Shahini , Domenic Bubel , Andreas Metzger

Accounting for Variance in Machine Learning Benchmarks

Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter…

Machine Learning · Computer Science 2021-03-05 Xavier Bouthillier , Pierre Delaunay , Mirko Bronzi , Assya Trofimov , Brennan Nichyporuk , Justin Szeto , Naz Sepah , Edward Raff , Kanika Madan , Vikram Voleti , Samira Ebrahimi Kahou , Vincent Michalski , Dmitriy Serdyuk , Tal Arbel , Chris Pal , Gaël Varoquaux , Pascal Vincent

Behavior of Hyper-Parameters for Selected Machine Learning Algorithms: An Empirical Investigation

Hyper-parameters (HPs) are an important part of machine learning (ML) model development and can greatly influence performance. This paper studies their behavior for three algorithms: Extreme Gradient Boosting (XGB), Random Forest (RF), and…

Machine Learning · Computer Science 2022-11-17 Anwesha Bhattacharyya , Joel Vaughan , Vijayan N. Nair

The Impact of Feature Scaling In Machine Learning: Effects on Regression and Classification Tasks

This research addresses the critical lack of comprehensive studies on feature scaling by systematically evaluating 12 scaling techniques - including several less common transformations - across 14 different Machine Learning algorithms and…

Machine Learning · Computer Science 2025-11-24 João Manoel Herrera Pinheiro , Suzana Vilas Boas de Oliveira , Thiago Henrique Segreto Silva , Pedro Antonio Rabelo Saraiva , Enzo Ferreira de Souza , Ricardo V. Godoy , Leonardo André Ambrosio , Marcelo Becker

Heterogeneous Random Forest

Random forest (RF) stands out as a highly favored machine learning approach for classification problems. The effectiveness of RF hinges on two key factors: the accuracy of individual trees and the diversity among them. In this study, we…

Machine Learning · Computer Science 2024-10-28 Ye-eun Kim , Seoung Yun Kim , Hyunjoong Kim

A Mathematical Foundation for Robust Machine Learning based on Bias-Variance Trade-off

A common assumption in machine learning is that samples are independently and identically distributed (i.i.d). However, the contributions of different samples are not identical in training. Some samples are difficult to learn and some…

Machine Learning · Computer Science 2021-11-23 Ou Wu , Weiyao Zhu , Yingjun Deng , Haixiang Zhang , Qinghu Hou

Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs

Large language models (LLMs) exhibit cognitive biases -- systematic tendencies of irrational decision-making, similar to those seen in humans. Prior work has found that these biases vary across models and can be amplified by instruction…

Computation and Language · Computer Science 2025-07-15 Itay Itzhak , Yonatan Belinkov , Gabriel Stanovsky

A Domain-Region Based Evaluation of ML Performance Robustness to Covariate Shift

Most machine learning methods assume that the input data distribution is the same in the training and testing phases. However, in practice, this stationarity is usually not met and the distribution of inputs differs, leading to unexpected…

Machine Learning · Computer Science 2023-04-19 Firas Bayram , Bestoun S. Ahmed

Towards Inferential Reproducibility of Machine Learning Research

Reliability of machine learning evaluation -- the consistency of observed evaluation scores across replicated model training runs -- is affected by several sources of nondeterminism which can be regarded as measurement noise. Current…

Machine Learning · Computer Science 2023-10-10 Michael Hagmann , Philipp Meier , Stefan Riezler

Run, Forest, Run? On Randomization and Reproducibility in Predictive Software Engineering

Machine learning (ML) has been widely used in the literature to automate software engineering tasks. However, ML outcomes may be sensitive to randomization in data sampling mechanisms and learning procedures. To understand whether and how…

Software Engineering · Computer Science 2020-12-16 Cynthia C. S. Liem , Annibale Panichella

Performance and Interpretability Comparisons of Supervised Machine Learning Algorithms: An Empirical Study

This paper compares the performances of three supervised machine learning algorithms in terms of predictive ability and model interpretation on structured or tabular data. The algorithms considered were scikit-learn implementations of…

Machine Learning · Statistics 2022-05-06 Alice J. Liu , Arpita Mukherjee , Linwei Hu , Jie Chen , Vijayan N. Nair

Understanding Random Forests: From Theory to Practice

Data analysis and machine learning have become an integrative part of the modern scientific methodology, offering automated procedures for the prediction of a phenomenon based on past observations, unraveling underlying patterns in data and…

Machine Learning · Statistics 2015-06-04 Gilles Louppe

Probabilistic Random Forest: A machine learning algorithm for noisy datasets

Machine learning (ML) algorithms become increasingly important in the analysis of astronomical data. However, since most ML algorithms are not designed to take data uncertainties into account, ML based studies are mostly restricted to data…

Instrumentation and Methods for Astrophysics · Physics 2018-12-26 Itamar Reis , Dalya Baron , Sahar Shahaf

Differentiable Random Partition Models

Partitioning a set of elements into an unknown number of mutually exclusive subsets is essential in many machine learning problems. However, assigning elements, such as samples in a dataset or neurons in a network layer, to an unknown and…

Machine Learning · Computer Science 2023-11-10 Thomas M. Sutter , Alain Ryser , Joram Liebeskind , Julia E. Vogt

Beyond Random Split for Assessing Statistical Model Performance

Even though a train/test split of the dataset randomly performed is a common practice, could not always be the best approach for estimating performance generalization under some scenarios. The fact is that the usual machine learning…

Machine Learning · Computer Science 2022-09-09 Carlos Catania , Jorge Guerra , Juan Manuel Romero , Gabriel Caffaratti , Martin Marchetta

Finding Influential Training Samples for Gradient Boosted Decision Trees

We address the problem of finding influential training samples for a particular case of tree ensemble-based models, e.g., Random Forest (RF) or Gradient Boosted Decision Trees (GBDT). A natural way of formalizing this problem is studying…

Machine Learning · Computer Science 2018-03-14 Boris Sharchilev , Yury Ustinovsky , Pavel Serdyukov , Maarten de Rijke

Nondeterminism and Instability in Neural Network Optimization

Nondeterminism in neural network optimization produces uncertainty in performance, making small improvements difficult to discern from run-to-run variability. While uncertainty can be reduced by training multiple model copies, doing so is…

Machine Learning · Computer Science 2021-07-13 Cecilia Summers , Michael J. Dinneen

A Random Matrix Perspective on Mixtures of Nonlinearities for Deep Learning

One of the distinguishing characteristics of modern deep learning systems is that they typically employ neural network architectures that utilize enormous numbers of parameters, often in the millions and sometimes even in the billions.…

Machine Learning · Statistics 2021-11-15 Ben Adlam , Jake Levinson , Jeffrey Pennington

The Effectiveness of Supervised Machine Learning Algorithms in Predicting Software Refactoring

Refactoring is the process of changing the internal structure of software to improve its quality without modifying its external behavior. Empirical studies have repeatedly shown that refactoring has a positive impact on the…

Software Engineering · Computer Science 2020-09-14 Maurício Aniche , Erick Maziero , Rafael Durelli , Vinicius Durelli