Related papers: A probabilistic database approach to autoencoder-b…

Batchwise Probabilistic Incremental Data Cleaning

Lack of data and data quality issues are among the main bottlenecks that prevent further artificial intelligence adoption within many organizations, pushing data scientists to spend most of their time cleaning data before being able to…

Databases · Computer Science 2020-11-11 Paulo H. Oliveira , Daniel S. Kaster , Caetano Traina-Jr. , Ihab F. Ilyas

HoloClean: Holistic Data Repairs with Probabilistic Inference

We introduce HoloClean, a framework for holistic data repairing driven by probabilistic inference. HoloClean unifies existing qualitative data repairing approaches, which rely on integrity constraints or external data sources, with…

Databases · Computer Science 2017-02-06 Theodoros Rekatsinas , Xu Chu , Ihab F. Ilyas , Christopher Ré

Autoencoder-based Attribute Noise Handling Method for Medical Data

Medical datasets are particularly subject to attribute noise, that is, missing and erroneous values. Attribute noise is known to be largely detrimental to learning performances. To maximize future learning performances it is primordial to…

Machine Learning · Computer Science 2022-06-23 Thomas Ranvier , Haytham Elgazel , Emmanuel Coquery , Khalid Benabdeslem

Intrinsic Self-Supervision for Data Quality Audits

Benchmark datasets in computer vision often contain off-topic images, near duplicates, and label errors, leading to inaccurate estimates of model performance. In this paper, we revisit the task of data cleaning and formalize it as either a…

Computer Vision and Pattern Recognition · Computer Science 2024-10-30 Fabian Gröger , Simone Lionetti , Philippe Gottfrois , Alvaro Gonzalez-Jimenez , Ludovic Amruthalingam , Labelling Consortium , Matthew Groh , Alexander A. Navarini , Marc Pouly

BayesWipe: A Scalable Probabilistic Framework for Cleaning BigData

Recent efforts in data cleaning of structured data have focused exclusively on problems like data deduplication, record matching, and data standardization; none of the approaches addressing these problems focus on fixing incorrect attribute…

Databases · Computer Science 2015-07-01 Sushovan De , Yuheng Hu , Meduri Venkata Vamsikrishna , Yi Chen , Subbarao Kambhampati

Tomographic Auto-Encoder: Unsupervised Bayesian Recovery of Corrupted Data

We propose a new probabilistic method for unsupervised recovery of corrupted data. Given a large ensemble of degraded samples, our method recovers accurate posteriors of clean values, allowing the exploration of the manifold of possible…

Machine Learning · Computer Science 2020-07-01 Francesco Tonolini , Pablo G. Moreno , Andreas Damianou , Roderick Murray-Smith

A Formal Framework For Probabilistic Unclean Databases

Most theoretical frameworks that focus on data errors and inconsistencies follow logic-based reasoning. Yet, practical data cleaning tools need to incorporate statistical reasoning to be effective in real-world data cleaning tasks.…

Databases · Computer Science 2019-01-28 Christopher De Sa , Ihab F. Ilyas , Benny Kimelfeld , Christopher Re , Theodoros Rekatsinas

Establishing strong imputation performance of a denoising autoencoder in a wide range of missing data problems

Dealing with missing data in data analysis is inevitable. Although powerful imputation methods that address this problem exist, there is still much room for improvement. In this study, we examined single imputation based on deep…

Machine Learning · Computer Science 2020-04-07 Najmeh Abiri , Björn Linse , Patrik Edén , Mattias Ohlsson

Towards Explainable Automated Data Quality Enhancement without Domain Knowledge

In the era of big data, ensuring the quality of datasets has become increasingly crucial across various domains. We propose a comprehensive framework designed to automatically assess and rectify data quality issues in any given dataset,…

Databases · Computer Science 2024-09-17 Djibril Sarr

Bayesian Data Cleaning for Web Data

Data Cleaning is a long standing problem, which is growing in importance with the mass of uncurated web data. State of the art approaches for handling inconsistent data are systems that learn and use conditional functional dependencies…

Databases · Computer Science 2012-04-18 Yuheng Hu , Sushovan De , Yi Chen , Subbarao Kambhampati

The Consistency of Probabilistic Databases with Independent Cells

A probabilistic database with attribute-level uncertainty consists of relations where cells of some attributes may hold probability distributions rather than deterministic content. Such databases arise, implicitly or explicitly, in the…

Databases · Computer Science 2022-12-26 Amir Gilad , Aviram Imber , Benny Kimelfeld

Data Quality Evaluation using Probability Models

This paper discusses an approach with machine-learning probability models to evaluate the difference between good and bad data quality in a dataset. A decision tree algorithm is used to predict data quality based on no domain knowledge of…

Machine Learning · Computer Science 2020-09-16 Allen ONeill

An epistemic approach to model uncertainty in data-graphs

Graph databases are becoming widely successful as data models that allow to effectively represent and process complex relationships among various types of data. As with any other type of data repository, graph databases may suffer from…

Databases · Computer Science 2023-07-14 Sergio Abriola , Santiago Cifuentes , María Vanina Martínez , Nina Pardal , Edwin Pin

PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming

Data cleaning is naturally framed as probabilistic inference in a generative model of ground-truth data and likely errors, but the diversity of real-world error patterns and the hardness of inference make Bayesian approaches difficult to…

Machine Learning · Computer Science 2022-11-22 Alexander K. Lew , Monica Agrawal , David Sontag , Vikash K. Mansinghka

Towards Blind Data Cleaning: A Case Study in Music Source Separation

The performance of deep learning models for music source separation heavily depends on training data quality. However, datasets are often corrupted by difficult-to-detect artifacts such as audio bleeding and label noise. Since the type and…

Audio and Speech Processing · Electrical Eng. & Systems 2025-10-20 Azalea Gui , Woosung Choi , Junghyun Koo , Kazuki Shimada , Takashi Shibuya , Joan Serrà , Wei-Hsiang Liao , Yuki Mitsufuji

Improving Unstructured Data Quality via Updatable Extracted Views

Improving data quality in unstructured documents is a long-standing challenge. Unstructured data, especially in textual form, inherently lacks defined semantics, which poses significant challenges for effective processing and for ensuring…

Databases · Computer Science 2025-02-26 Besat Kassaie , Frank Wm. Tompa

Active label cleaning for improved dataset quality under resource constraints

Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have an often-overlooked confounding effect on the assessment of model performance. Nevertheless, employing experts to…

Computer Vision and Pattern Recognition · Computer Science 2022-04-25 Melanie Bernhardt , Daniel C. Castro , Ryutaro Tanno , Anton Schwaighofer , Kerem C. Tezcan , Miguel Monteiro , Shruthi Bannur , Matthew Lungren , Aditya Nori , Ben Glocker , Javier Alvarez-Valle , Ozan Oktay

Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables

Data cleaning is a long-standing challenge in data management. While powerful logic and statistical algorithms have been developed to detect and repair data errors in tables, existing algorithms predominantly rely on domain-experts to first…

Databases · Computer Science 2025-04-16 Qixu Chen , Yeye He , Raymond Chi-Wing Wong , Weiwei Cui , Song Ge , Haidong Zhang , Dongmei Zhang , Surajit Chaudhuri

BClean: A Bayesian Data Cleaning System

There is a considerable body of work on data cleaning which employs various principles to rectify erroneous data and transform a dirty dataset into a cleaner one. One of prevalent approaches is probabilistic methods, including Bayesian…

Artificial Intelligence · Computer Science 2023-11-14 Jianbin Qin , Sifan Huang , Yaoshu Wang , Jing Zhu , Yifan Zhang , Yukai Miao , Rui Mao , Makoto Onizuka , Chuan Xiao

Representation-Based Data Quality Audits for Audio

Data quality issues such as off-topic samples, near duplicates, and label errors often limit the performance of audio-based systems. This paper addresses these issues by adapting SelfClean, a representation-to-rank data auditing framework,…

Sound · Computer Science 2025-10-01 Alvaro Gonzalez-Jimenez , Fabian Gröger , Linda Wermelinger , Andrin Bürli , Iason Kastanis , Simone Lionetti , Marc Pouly