Related papers: Multiple Imputation Through XGBoost

XGBoost: A Scalable Tree Boosting System

Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results…

Machine Learning · Computer Science 2016-06-14 Tianqi Chen , Carlos Guestrin

MIBoost: A gradient boosting algorithm for variable selection after multiple imputation

Statistical learning methods for automated variable selection, such as the Least Absolute Shrinkage and Selection Operator (LASSO), elastic nets, and gradient boosting, have become increasingly popular tools for building powerful prediction…

Machine Learning · Statistics 2026-04-13 Robert Kuchen

Adapting tree-based multiple imputation methods for multi-level data? A simulation study

When data have a hierarchical structure, such as students nested within classrooms, ignoring dependencies between observations can compromise the validity of imputation procedures. Standard tree-based imputation methods implicitly assume…

Applications · Statistics 2025-03-21 Nico Föge , Jakob Schwerter , Ketevan Gurtskaia , Markus Pauly , Philipp Doebler

A Comparative Analysis of XGBoost

XGBoost is a scalable ensemble technique based on gradient boosting that has demonstrated to be a reliable and efficient machine learning challenge solver. This work proposes a practical analysis of how this novel technique works in terms…

Machine Learning · Computer Science 2023-05-05 Candice Bentéjac , Anna Csörgő , Gonzalo Martínez-Muñoz

Tree Boosting Methods for Balanced andImbalanced Classification and their Robustness Over Time in Risk Assessment

Most real-world classification problems deal with imbalanced datasets, posing a challenge for Artificial Intelligence (AI), i.e., machine learning algorithms, because the minority class, which is of extreme interest, often proves difficult…

Machine Learning · Computer Science 2025-04-28 Gissel Velarde , Michael Weichert , Anuj Deshmunkh , Sanjay Deshmane , Anindya Sudhir , Khushboo Sharma , Vaibhav Joshi

A Simple and Fast Baseline for Tuning Large XGBoost Models

XGBoost, a scalable tree boosting algorithm, has proven effective for many prediction tasks of practical interest, especially using tabular datasets. Hyperparameter tuning can further improve the predictive performance, but unlike neural…

Machine Learning · Computer Science 2021-11-16 Sanyam Kapoor , Valerio Perrone

XGenBoost: Synthesizing Small and Large Tabular Datasets with XGBoost

Tree ensembles such as XGBoost are often preferred for discriminative tasks in mixed-type tabular data, due to their inductive biases, minimal hyperparameter tuning, and training efficiency. We argue that these qualities, when leveraged…

Machine Learning · Computer Science 2026-03-10 Jim Achterberg , Marcel Haas , Bram van Dijk , Marco Spruit

XGBoost: Scalable GPU Accelerated Learning

We describe the multi-GPU gradient boosting algorithm implemented in the XGBoost library (https://github.com/dmlc/xgboost). Our algorithm allows fast, scalable training on multi-GPU systems with all of the features of the XGBoost library.…

Machine Learning · Computer Science 2018-07-02 Rory Mitchell , Andrey Adinets , Thejaswi Rao , Eibe Frank

Generalized XGBoost Method

The XGBoost method has many advantages and is especially suitable for statistical analysis of big data, but its loss function is limited to convex functions. In many specific applications, a nonconvex loss function would be preferable. In…

Machine Learning · Computer Science 2022-01-20 Yang Guang

Multiple Imputation via Generative Adversarial Network for High-dimensional Blockwise Missing Value Problems

Missing data are present in most real world problems and need careful handling to preserve the prediction accuracy and statistical consistency in the downstream analysis. As the gold standard of handling missing data, multiple imputation…

Machine Learning · Computer Science 2021-12-23 Zongyu Dai , Zhiqi Bu , Qi Long

Multiple Imputation with Neural Network Gaussian Process for High-dimensional Incomplete Data

Missing data are ubiquitous in real world applications and, if not adequately handled, may lead to the loss of information and biased findings in downstream analysis. Particularly, high-dimensional incomplete data with a moderate sample…

Machine Learning · Computer Science 2022-12-23 Zongyu Dai , Zhiqi Bu , Qi Long

Multiple Imputation with Massive Data: An Application to the Panel Study of Income Dynamics

\Multiple imputation (MI) is a popular and well-established method for handling missing data in multivariate data sets, but its practicality for use in massive and complex data sets has been questioned. One such data set is the Panel Study…

Methodology · Statistics 2021-08-18 Yajuan Si , Steve Heeringa , David Johnson , Roderick Little , Wenshuo Liu , Fabian Pfeffer , Trivellore Raghunathan

Meta-Imputation Balanced (MIB): An Ensemble Approach for Handling Missing Data in Biomedical Machine Learning

Missing data represents a fundamental challenge in machine learning applications, often reducing model performance and reliability. This problem is particularly acute in fields like bioinformatics and clinical machine learning, where…

Machine Learning · Computer Science 2025-09-04 Fatemeh Azad , Zoran Bosnić , Matjaž Kukar

Multiple imputation using dimension reduction techniques for high-dimensional data

Missing data present challenges in data analysis. Naive analyses such as complete-case and available-case analysis may introduce bias and loss of efficiency, and produce unreliable results. Multiple imputation (MI) is one of the most widely…

Methodology · Statistics 2019-05-15 Domonique W. Hodge , Sandra E. Safo , Qi Long

Multi-Target XGBoostLSS Regression

Current implementations of Gradient Boosting Machines are mostly designed for single-target regression tasks and commonly assume independence between responses when used in multivariate settings. As such, these models are not well suited if…

Machine Learning · Computer Science 2022-10-14 Alexander März

Multivariate Boosted Trees and Applications to Forecasting and Control

Gradient boosted trees are competition-winning, general-purpose, non-parametric regressors, which exploit sequential model fitting and gradient descent to minimize a specific loss function. The most popular implementations are tailored to…

Machine Learning · Computer Science 2022-08-23 Lorenzo Nespoli , Vasco Medici

A Comparison of Modeling Preprocessing Techniques

This paper compares the performance of various data processing methods in terms of predictive performance for structured data. This paper also seeks to identify and recommend preprocessing methodologies for tree-based binary classification…

Methodology · Statistics 2023-02-27 Tosan Johnson , Alice J. Liu , Syed Raza , Aaron McGuire

Adaptive XGBoost for Evolving Data Streams

Boosting is an ensemble method that combines base models in a sequential manner to achieve high predictive accuracy. A popular learning algorithm based on this ensemble method is eXtreme Gradient Boosting (XGB). We present an adaptation of…

Machine Learning · Computer Science 2020-05-18 Jacob Montiel , Rory Mitchell , Eibe Frank , Bernhard Pfahringer , Talel Abdessalem , Albert Bifet

Generalized Optimal Classification Trees: A Mixed-Integer Programming Approach

Global optimization of decision trees is a long-standing challenge in combinatorial optimization, yet such models play an important role in interpretable machine learning. Although the problem has been investigated for several decades, only…

Machine Learning · Computer Science 2026-02-03 Jiancheng Tu , Wenqi Fan , Zhibin Wu

Combining multiple imputation with raking of weights: An efficient and robust approach in the setting of nearly-true models

Multiple imputation provides us with efficient estimators in model-based methods for handling missing data under the true model. It is also well-understood that design-based estimators are robust methods that do not require accurately…

Methodology · Statistics 2020-06-11 Kyunghee Han , Pamela A. Shaw , Thomas Lumley