Related papers: Improving Data Cleaning Using Discrete Optimizatio…

Data Imputation by Pursuing Better Classification: A Supervised Kernel-Based Method

Data imputation, the process of filling in missing feature elements for incomplete data sets, plays a crucial role in data-driven learning. A fundamental belief is that data imputation is helpful for learning performance, and it follows…

Machine Learning · Computer Science 2025-09-30 Ruikai Yang , Fan He , Mingzhen He , Kaijie Wang , Xiaolin Huang

Batchwise Probabilistic Incremental Data Cleaning

Lack of data and data quality issues are among the main bottlenecks that prevent further artificial intelligence adoption within many organizations, pushing data scientists to spend most of their time cleaning data before being able to…

Databases · Computer Science 2020-11-11 Paulo H. Oliveira , Daniel S. Kaster , Caetano Traina-Jr. , Ihab F. Ilyas

Learning Over Dirty Data with Minimal Repairs

Missing data often exists in real-world datasets, requiring significant time and effort for data repair to learn accurate models. In this paper, we show that imputing all missing values is not always necessary to achieve an accurate ML…

Machine Learning · Computer Science 2026-03-19 Cheng Zhen , Prayoga , Nischal Aryal , Arash Termehchy , Garrett Biwer , Lubna Alzamil

DPER: Efficient Parameter Estimation for Randomly Missing Data

The missing data problem has been broadly studied in the last few decades and has various applications in different areas such as statistics or bioinformatics. Even though many methods have been developed to tackle this challenge, most of…

Machine Learning · Statistics 2021-06-10 Thu Nguyen , Khoi Minh Nguyen-Duy , Duy Ho Minh Nguyen , Binh T. Nguyen , Bruce Alan Wade

Evolving imputation strategies for missing data in classification problems with TPOT

Missing data has a ubiquitous presence in real-life applications of machine learning techniques. Imputation methods are algorithms conceived for restoring missing values in the data, based on other entries in the database. The choice of the…

Machine Learning · Computer Science 2017-08-16 Unai Garciarena , Roberto Santana , Alexander Mendiburu

Navigating Data Corruption in Machine Learning: Balancing Quality, Quantity, and Imputation Strategies

Data corruption, including missing and noisy data, poses significant challenges in real-world machine learning. This study investigates the effects of data corruption on model performance and explores strategies to mitigate these effects…

Machine Learning · Computer Science 2025-05-22 Qi Liu , Wanjing Ma

Optimized Linear Imputation

Often in real-world datasets, especially in high dimensional data, some feature values are missing. Since most data analysis and statistical methods do not handle gracefully missing values, the first step in the analysis requires the…

Machine Learning · Statistics 2016-12-08 Yehezkel S. Resheff , Daphna Weinshall

A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance

Data cleaning is the initial stage of any machine learning project and is one of the most critical processes in data analysis. It is a critical step in ensuring that the dataset is devoid of incorrect or erroneous data. It can be done…

Databases · Computer Science 2021-09-16 Ga Young Lee , Lubna Alzamil , Bakhtiyar Doskenov , Arash Termehchy

The More Data, the Better? Demystifying Deletion-Based Methods in Linear Regression with Missing Data

We compare two deletion-based methods for dealing with the problem of missing observations in linear regression analysis. One is the complete-case analysis (CC, or listwise deletion) that discards all incomplete observations and only uses…

Methodology · Statistics 2023-05-02 Tianchen Xu , Kun Chen , Gen Li

Missing Value Estimation Algorithms on Cluster and Representativeness Preservation of Gene Expression Microarray Data

Missing values are largely inevitable in gene expression microarray studies. Data sets often have significant omissions due to individuals dropping out of experiments, errors in data collection, image corruptions, and so on. Missing data…

Quantitative Methods · Quantitative Biology 2018-09-18 Marie Li

An Interdisciplinary and Cross-Task Review on Missing Data Imputation

Missing data is a fundamental challenge in data science, significantly hindering analysis and decision-making across a wide range of disciplines, including healthcare, bioinformatics, social science, e-commerce, and industrial monitoring.…

Machine Learning · Statistics 2026-05-12 Jicong Fan

Improving Missing Data Imputation with Deep Generative Models

Datasets with missing values are very common on industry applications, and they can have a negative impact on machine learning models. Recent studies introduced solutions to the problem of imputing missing values based on deep generative…

Machine Learning · Computer Science 2019-02-28 Ramiro D. Camino , Christian A. Hammerschmidt , Radu State

Missing Data Imputation by Reducing Mutual Information with Rectified Flows

This paper introduces a novel iterative method for missing data imputation that sequentially reduces the mutual information between data and the corresponding missingness mask. Inspired by GAN-based approaches that train generators to…

Machine Learning · Statistics 2025-11-26 Jiahao Yu , Qizhen Ying , Leyang Wang , Ziyue Jiang , Song Liu

REIN: A Comprehensive Benchmark Framework for Data Cleaning Methods in ML Pipelines

Nowadays, machine learning (ML) plays a vital role in many aspects of our daily life. In essence, building well-performing ML applications requires the provision of high-quality data throughout the entire life-cycle of such applications.…

Databases · Computer Science 2023-02-10 Mohamed Abdelaal , Christian Hammacher , Harald Schoening

Learning to Remove Cuts in Integer Linear Programming

Cutting plane methods are a fundamental approach for solving integer linear programs (ILPs). In each iteration of such methods, additional linear constraints (cuts) are introduced to the constraint set with the aim of excluding the previous…

Optimization and Control · Mathematics 2024-06-28 Pol Puigdemont , Stratis Skoulakis , Grigorios Chrysos , Volkan Cevher

MAIN: Multihead-Attention Imputation Networks

The problem of missing data, usually absent incurated and competition-standard datasets, is an unfortunate reality for most machine learning models used in industry applications. Recent work has focused on understanding the nature and the…

Machine Learning · Computer Science 2022-01-25 Spyridon Mouselinos , Kyriakos Polymenakos , Antonis Nikitakis , Konstantinos Kyriakopoulos

Towards a methodology for addressing missingness in datasets, with an application to demographic health datasets

Missing data is a common concern in health datasets, and its impact on good decision-making processes is well documented. Our study's contribution is a methodology for tackling missing data problems using a combination of synthetic dataset…

Machine Learning · Computer Science 2022-11-08 Gift Khangamwa , Terence L. van Zyl , Clint J. van Alten

Machine learning with incomplete datasets using multi-objective optimization models

Machine learning techniques have been developed to learn from complete data. When missing values exist in a dataset, the incomplete data should be preprocessed separately by removing data points with missing values or imputation. In this…

Machine Learning · Computer Science 2020-12-25 Hadi A. Khorshidi , Michael Kirley , Uwe Aickelin

Dealing with missing data using attention and latent space regularization

Most practical data science problems encounter missing data. A wide variety of solutions exist, each with strengths and weaknesses that depend upon the missingness-generating process. Here we develop a theoretical framework for training and…

Machine Learning · Computer Science 2022-11-15 Jahan C. Penny-Dimri , Christoph Bergmeir , Julian Smith

Design of Experiments with Imputable Feature Data: An Entropy-Based Approach

Tactical selection of experiments to estimate an underlying model is an innate task across various fields. Since each experiment has costs associated with it, selecting statistically significant experiments becomes necessary. Classic linear…

Optimization and Control · Mathematics 2021-03-30 Raj K. Velicheti , Amber Srivastava , Srinivasa M. Salapaka