Related papers: Pattern-Driven Data Cleaning

Automatic Data Repair: Are We Ready to Deploy?

Data quality is paramount in today's data-driven world, especially in the era of generative AI. Dirty data with errors and inconsistencies usually leads to flawed insights, unreliable decision-making, and biased or low-quality outputs from…

Databases · Computer Science 2025-04-01 Wei Ni , Xiaoye Miao , Xiangyu Zhao , Yangyang Wu , Jianwei Yin

On the Relative Trust between Inconsistent Data and Inaccurate Constraints

Functional dependencies (FDs) specify the intended data semantics while violations of FDs indicate deviation from these semantics. In this paper, we study a data cleaning problem in which the FDs may not be completely correct, e.g., due to…

Databases · Computer Science 2012-07-25 George Beskales , Ihab F. Ilyas , Lukasz Golab , Artur Galiullin

Automatic Weighted Matching Rectifying Rule Discovery for Data Repairing

Data repairing is a key problem in data cleaning which aims to uncover and rectify data errors. Traditional methods depend on data dependencies to check the existence of errors in data, but they fail to rectify the errors. To overcome this…

Databases · Computer Science 2019-09-24 Hiba Abu Ahmad , Hongzhi Wang

Human-Centric Data Cleaning [Vision]

Data Cleaning refers to the process of detecting and fixing errors in the data. Human involvement is instrumental at several stages of this process, e.g., to identify and repair errors, to validate computed repairs, etc. There is currently…

Databases · Computer Science 2018-01-03 El Kindi Rezig , Mourad Ouzzani , Ahmed K. Elmagarmid , Walid G. Aref

Learning Dependency Models for Subset Repair

Inconsistent values are commonly encountered in real-world applications, which can negatively impact data analysis and decision-making. While existing research primarily focuses on identifying the smallest removal set to resolve…

Data Structures and Algorithms · Computer Science 2025-12-23 Haoda Li , Jiahui Chen , Yu Sun , Shaoxu Song , Haiwei Zhang , Xiaojie Yuan

Batchwise Probabilistic Incremental Data Cleaning

Lack of data and data quality issues are among the main bottlenecks that prevent further artificial intelligence adoption within many organizations, pushing data scientists to spend most of their time cleaning data before being able to…

Databases · Computer Science 2020-11-11 Paulo H. Oliveira , Daniel S. Kaster , Caetano Traina-Jr. , Ihab F. Ilyas

Enabling Automatic Repair of Source Code Vulnerabilities Using Data-Driven Methods

Users around the world rely on software-intensive systems in their day-to-day activities. These systems regularly contain bugs and security vulnerabilities. To facilitate bug fixing, data-driven models of automatic program repair use pairs…

Software Engineering · Computer Science 2022-02-08 Anastasiia Grishina

Cleaning data with Swipe

The repair problem for functional dependencies is the problem where an input database needs to be modified such that all functional dependencies are satisfied and the difference with the original database is minimal. The output database is…

Databases · Computer Science 2024-04-18 Toon Boeckling , Antoon Bronselaer

Complexity and Efficient Algorithms for Data Inconsistency Evaluating and Repairing

Data inconsistency evaluating and repairing are major concerns in data quality management. As the basic computing task, optimal subset repair is not only applied for cost estimation during the progress of database repairing, but also…

Databases · Computer Science 2020-01-14 Dongjing Miao , Zhipeng Cai , Jianzhong Li , Xiangyu Gao , Xianmin Liu

Learning Over Dirty Data with Minimal Repairs

Missing data often exists in real-world datasets, requiring significant time and effort for data repair to learn accurate models. In this paper, we show that imputing all missing values is not always necessary to achieve an accurate ML…

Machine Learning · Computer Science 2026-03-19 Cheng Zhen , Prayoga , Nischal Aryal , Arash Termehchy , Garrett Biwer , Lubna Alzamil

A logic-based framework for database repairs

We introduce a general abstract framework for database repairs, where the repair notions are defined using formal logic. We distinguish between integrity constraints and so-called query constraints. The former are used to model consistency…

Databases · Computer Science 2025-03-31 Nicolas Fröhlich , Arne Meier , Nina Pardal , Jonni Virtema

Discovery of Paradigm Dependencies

Missing and incorrect values often cause serious consequences. To deal with these data quality problems, a class of common employed tools are dependency rules, such as Functional Dependencies (FDs), Conditional Functional Dependencies…

Databases · Computer Science 2017-10-10 Jizhou Sun , Jianzhong Li , Hong Gao

Towards Explainable Automated Data Quality Enhancement without Domain Knowledge

In the era of big data, ensuring the quality of datasets has become increasingly crucial across various domains. We propose a comprehensive framework designed to automatically assess and rectify data quality issues in any given dataset,…

Databases · Computer Science 2024-09-17 Djibril Sarr

Time Series Data Cleaning: From Anomaly Detection to Anomaly Repairing (Technical Report)

Errors are prevalent in time series data, such as GPS trajectories or sensor readings. Existing methods focus more on anomaly detection but not on repairing the detected anomalies. By simply filtering out the dirty data via anomaly…

Databases · Computer Science 2020-03-30 Aoqian Zhang , Shaoxu Song , Jianmin Wang , Philip S. Yu

Unambiguous Prioritized Repairing of Databases

In its traditional definition, a repair of an inconsistent database is a consistent database that differs from the inconsistent one in a "minimal way". Often, repairs are not equally legitimate, as it is desired to prefer one over another;…

Databases · Computer Science 2016-03-08 Benny Kimelfeld , Ester Livshits , Liat Peterfreund

A Principled Approach to Failure Analysis and Model Repairment: Demonstration in Medical Imaging

Machine learning models commonly exhibit unexpected failures post-deployment due to either data shifts or uncommon situations in the training environment. Domain experts typically go through the tedious process of inspecting the failure…

Machine Learning · Computer Science 2021-09-28 Thomas Henn , Yasukazu Sakamoto , Clément Jacquet , Shunsuke Yoshizawa , Masamichi Andou , Stephen Tchen , Ryosuke Saga , Hiroyuki Ishihara , Katsuhiko Shimizu , Yingzhen Li , Ryutaro Tanno

An Effective Data-Driven Approach for Localizing Deep Learning Faults

Deep Learning (DL) applications are being used to solve problems in critical domains (e.g., autonomous driving or medical diagnosis systems). Thus, developers need to debug their systems to ensure that the expected behavior is delivered.…

Software Engineering · Computer Science 2023-07-19 Mohammad Wardat , Breno Dantas Cruz , Wei Le , Hridesh Rajan

The Human Factor in Data Cleaning: Exploring Preferences and Biases

Data cleaning is often framed as a technical preprocessing step, yet in practice it relies heavily on human judgment. We report results from a controlled survey study in which participants performed error detection, data repair and…

Databases · Computer Science 2026-03-26 Hazim AbdElazim , Shadman Islam , Mostafa Milani

Learning Over Dirty Data Without Cleaning

Real-world datasets are dirty and contain many errors. Examples of these issues are violations of integrity constraints, duplicates, and inconsistencies in representing data values and entities. Learning over dirty databases may result in…

Databases · Computer Science 2020-04-07 Jose Picado , John Davis , Arash Termehchy , Ga Young Lee

Explainable Data Imputation using Constraints

Data values in a dataset can be missing or anomalous due to mishandling or human error. Analysing data with missing values can create bias and affect the inferences. Several analysis methods, such as principle components analysis or…

Artificial Intelligence · Computer Science 2022-05-11 Sandeep Hans , Diptikalyan Saha , Aniya Aggarwal