Related papers: Batchwise Probabilistic Incremental Data Cleaning

HoloClean: Holistic Data Repairs with Probabilistic Inference

We introduce HoloClean, a framework for holistic data repairing driven by probabilistic inference. HoloClean unifies existing qualitative data repairing approaches, which rely on integrity constraints or external data sources, with…

Databases · Computer Science 2017-02-06 Theodoros Rekatsinas , Xu Chu , Ihab F. Ilyas , Christopher Ré

Human-Centric Data Cleaning [Vision]

Data Cleaning refers to the process of detecting and fixing errors in the data. Human involvement is instrumental at several stages of this process, e.g., to identify and repair errors, to validate computed repairs, etc. There is currently…

Databases · Computer Science 2018-01-03 El Kindi Rezig , Mourad Ouzzani , Ahmed K. Elmagarmid , Walid G. Aref

Automatic Data Repair: Are We Ready to Deploy?

Data quality is paramount in today's data-driven world, especially in the era of generative AI. Dirty data with errors and inconsistencies usually leads to flawed insights, unreliable decision-making, and biased or low-quality outputs from…

Databases · Computer Science 2025-04-01 Wei Ni , Xiaoye Miao , Xiangyu Zhao , Yangyang Wu , Jianwei Yin

A probabilistic database approach to autoencoder-based data cleaning

Data quality problems are a large threat in data science. In this paper, we propose a data-cleaning autoencoder capable of near-automatic data quality improvement. It learns the structure and dependencies in the data and uses it as evidence…

Databases · Computer Science 2021-08-04 R. R. Mauritz , F. P. J. Nijweide , J. Goseling , M. van Keulen

A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance

Data cleaning is the initial stage of any machine learning project and is one of the most critical processes in data analysis. It is a critical step in ensuring that the dataset is devoid of incorrect or erroneous data. It can be done…

Databases · Computer Science 2021-09-16 Ga Young Lee , Lubna Alzamil , Bakhtiyar Doskenov , Arash Termehchy

Pattern-Driven Data Cleaning

Data is inherently dirty and there has been a sustained effort to come up with different approaches to clean it. A large class of data repair algorithms rely on data-quality rules and integrity constraints to detect and repair the data. A…

Databases · Computer Science 2017-12-29 El Kindi Rezig , Mourad Ouzzani , Walid G. Aref , Ahmed K. Elmagarmid , Ahmed R. Mahmood

Towards Explainable Automated Data Quality Enhancement without Domain Knowledge

In the era of big data, ensuring the quality of datasets has become increasingly crucial across various domains. We propose a comprehensive framework designed to automatically assess and rectify data quality issues in any given dataset,…

Databases · Computer Science 2024-09-17 Djibril Sarr

Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach

The wide use of machine learning is fundamentally changing the software development paradigm (a.k.a. Software 2.0) where data becomes a first-class citizen, on par with code. As machine learning is used in sensitive applications, it becomes…

Databases · Computer Science 2019-04-25 Ki Hyun Tae , Yuji Roh , Young Hun Oh , Hyunsu Kim , Steven Euijong Whang

Iterative Data Curation with Theoretical Guarantees

In recent years, more and more large data sets have become available. Data accuracy, the absence of verifiable errors in data, is crucial for these large materials to enable high-quality research, downstream applications, and model…

Methodology · Statistics 2025-10-27 Väinö Yrjänäinen , Johan Jonasson , Måns Magnusson

Improving Unstructured Data Quality via Updatable Extracted Views

Improving data quality in unstructured documents is a long-standing challenge. Unstructured data, especially in textual form, inherently lacks defined semantics, which poses significant challenges for effective processing and for ensuring…

Databases · Computer Science 2025-02-26 Besat Kassaie , Frank Wm. Tompa

Data Cleaning and Machine Learning: A Systematic Literature Review

Context: Machine Learning (ML) is integrated into a growing number of systems for various applications. Because the performance of an ML model is highly dependent on the quality of the data it has been trained on, there is a growing…

Machine Learning · Computer Science 2024-06-03 Pierre-Olivier Côté , Amin Nikanjam , Nafisa Ahmed , Dmytro Humeniuk , Foutse Khomh

A Primer on the Data Cleaning Pipeline

The availability of both structured and unstructured databases, such as electronic health data, social media data, patent data, and surveys that are often updated in real time, among others, has grown rapidly over the past decade. With this…

Databases · Computer Science 2023-07-26 Rebecca C. Steorts

Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective

Data-centric AI is at the center of a fundamental shift in software engineering where machine learning becomes the new software, powered by big data and computing infrastructure. Here software engineering needs to be re-thought where data…

Machine Learning · Computer Science 2022-12-27 Steven Euijong Whang , Yuji Roh , Hwanjun Song , Jae-Gil Lee

Improving Data Cleaning Using Discrete Optimization

One of the most important processing steps in any analysis pipeline is handling missing data. Traditional approaches simply delete any sample or feature with missing elements. Recent imputation methods replace missing data based on assumed…

Databases · Computer Science 2024-05-03 Kenneth Smith , Sharlee Climer

Incremental Consistent Updating of Incomplete Databases

Efficient consistency maintenance of incomplete and dynamic real-life databases is a quality label for further data analysis. In prior work, we tackled the generic problem of database updating in the presence of tuple generating constraints…

Databases · Computer Science 2024-05-16 Jacques Chabin , Mirian Halfeld Ferrari , Nicolas Hiot , Dominique Laurent

A Comprehensive Study of Class Incremental Learning Algorithms for Visual Tasks

The ability of artificial agents to increment their capabilities when confronted with new data is an open challenge in artificial intelligence. The main challenge faced in such cases is catastrophic forgetting, i.e., the tendency of neural…

Machine Learning · Computer Science 2020-12-16 Eden Belouadah , Adrian Popescu , Ioannis Kanellos

Towards "all-inclusive" Data Preparation to ensure Data Quality

Data preparation, especially data cleaning, is very important to ensure data quality and to improve the output of automated decision systems. Since there is no single tool that covers all steps required, a combination of tools -- namely a…

Databases · Computer Science 2023-08-29 Valerie Restat

A Survey on Data Collection for Machine Learning: a Big Data -- AI Integration Perspective

Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more…

Machine Learning · Computer Science 2019-08-13 Yuji Roh , Geon Heo , Steven Euijong Whang

BClean: A Bayesian Data Cleaning System

There is a considerable body of work on data cleaning which employs various principles to rectify erroneous data and transform a dirty dataset into a cleaner one. One of prevalent approaches is probabilistic methods, including Bayesian…

Artificial Intelligence · Computer Science 2023-11-14 Jianbin Qin , Sifan Huang , Yaoshu Wang , Jing Zhu , Yifan Zhang , Yukai Miao , Rui Mao , Makoto Onizuka , Chuan Xiao

Learning Over Dirty Data with Minimal Repairs

Missing data often exists in real-world datasets, requiring significant time and effort for data repair to learn accurate models. In this paper, we show that imputing all missing values is not always necessary to achieve an accurate ML…

Machine Learning · Computer Science 2026-03-19 Cheng Zhen , Prayoga , Nischal Aryal , Arash Termehchy , Garrett Biwer , Lubna Alzamil