Related papers: Bayesian Data Cleaning for Web Data

BClean: A Bayesian Data Cleaning System

There is a considerable body of work on data cleaning which employs various principles to rectify erroneous data and transform a dirty dataset into a cleaner one. One of prevalent approaches is probabilistic methods, including Bayesian…

Artificial Intelligence · Computer Science 2023-11-14 Jianbin Qin , Sifan Huang , Yaoshu Wang , Jing Zhu , Yifan Zhang , Yukai Miao , Rui Mao , Makoto Onizuka , Chuan Xiao

BayesWipe: A Scalable Probabilistic Framework for Cleaning BigData

Recent efforts in data cleaning of structured data have focused exclusively on problems like data deduplication, record matching, and data standardization; none of the approaches addressing these problems focus on fixing incorrect attribute…

Databases · Computer Science 2015-07-01 Sushovan De , Yuheng Hu , Meduri Venkata Vamsikrishna , Yi Chen , Subbarao Kambhampati

Pattern-Driven Data Cleaning

Data is inherently dirty and there has been a sustained effort to come up with different approaches to clean it. A large class of data repair algorithms rely on data-quality rules and integrity constraints to detect and repair the data. A…

Databases · Computer Science 2017-12-29 El Kindi Rezig , Mourad Ouzzani , Walid G. Aref , Ahmed K. Elmagarmid , Ahmed R. Mahmood

A Formal Framework For Probabilistic Unclean Databases

Most theoretical frameworks that focus on data errors and inconsistencies follow logic-based reasoning. Yet, practical data cleaning tools need to incorporate statistical reasoning to be effective in real-world data cleaning tasks.…

Databases · Computer Science 2019-01-28 Christopher De Sa , Ihab F. Ilyas , Benny Kimelfeld , Christopher Re , Theodoros Rekatsinas

A probabilistic database approach to autoencoder-based data cleaning

Data quality problems are a large threat in data science. In this paper, we propose a data-cleaning autoencoder capable of near-automatic data quality improvement. It learns the structure and dependencies in the data and uses it as evidence…

Databases · Computer Science 2021-08-04 R. R. Mauritz , F. P. J. Nijweide , J. Goseling , M. van Keulen

Batchwise Probabilistic Incremental Data Cleaning

Lack of data and data quality issues are among the main bottlenecks that prevent further artificial intelligence adoption within many organizations, pushing data scientists to spend most of their time cleaning data before being able to…

Databases · Computer Science 2020-11-11 Paulo H. Oliveira , Daniel S. Kaster , Caetano Traina-Jr. , Ihab F. Ilyas

Mining CFD Rules on Big Data

Current conditional functional dependencies (CFDs) discovery algorithms always need a well-prepared training data set. This makes them difficult to be applied on large datasets which are always in low-quality. To handle the volume issue of…

Databases · Computer Science 2018-08-07 Hongzhi Wang , Mingda Li , Jiawei Zhao , Jianzhong Li , Hong Gao

Uncertainty Quantification with Generative Models

We develop a generative model-based approach to Bayesian inverse problems, such as image reconstruction from noisy and incomplete images. Our framework addresses two common challenges of Bayesian reconstructions: 1) It makes use of complex,…

Machine Learning · Statistics 2019-10-24 Vanessa Böhm , François Lanusse , Uroš Seljak

Learning Bayesian Networks from Big Data with Greedy Search: Computational Complexity and Efficient Implementation

Learning the structure of Bayesian networks from data is known to be a computationally challenging, NP-hard problem. The literature has long investigated how to perform structure learning from data containing large numbers of variables,…

Computation · Statistics 2019-10-25 Marco Scutari , Claudia Vitolo , Allan Tucker

Bayesian Structural Learning for an Improved Diagnosis of Cyber-Physical Systems

The diagnosis of cyber-physical systems aims to detect faulty behaviour, its root cause and a mitigation or even prevention policy. Therefore, diagnosis relies on a representation of the system's functional and faulty behaviour combined…

Machine Learning · Computer Science 2021-10-13 Nicolas Olivain , Philipp Tiefenbacher , Jens Kohl

Model Debiasing by Learnable Data Augmentation

Deep Neural Networks are well known for efficiently fitting training data, yet experiencing poor generalization capabilities whenever some kind of bias dominates over the actual task labels, resulting in models learning "shortcuts". In…

Machine Learning · Computer Science 2024-08-12 Pietro Morerio , Ruggero Ragonesi , Vittorio Murino

Towards Blind Data Cleaning: A Case Study in Music Source Separation

The performance of deep learning models for music source separation heavily depends on training data quality. However, datasets are often corrupted by difficult-to-detect artifacts such as audio bleeding and label noise. Since the type and…

Audio and Speech Processing · Electrical Eng. & Systems 2025-10-20 Azalea Gui , Woosung Choi , Junghyun Koo , Kazuki Shimada , Takashi Shibuya , Joan Serrà , Wei-Hsiang Liao , Yuki Mitsufuji

Robust Neural Processes for Noisy Data

Models that adapt their predictions based on some given contexts, also known as in-context learning, have become ubiquitous in recent years. We propose to study the behavior of such models when data is contaminated by noise. Towards this…

Machine Learning · Computer Science 2024-11-05 Chen Shapira , Dan Rosenbaum

A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance

Data cleaning is the initial stage of any machine learning project and is one of the most critical processes in data analysis. It is a critical step in ensuring that the dataset is devoid of incorrect or erroneous data. It can be done…

Databases · Computer Science 2021-09-16 Ga Young Lee , Lubna Alzamil , Bakhtiyar Doskenov , Arash Termehchy

Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach

The wide use of machine learning is fundamentally changing the software development paradigm (a.k.a. Software 2.0) where data becomes a first-class citizen, on par with code. As machine learning is used in sensitive applications, it becomes…

Databases · Computer Science 2019-04-25 Ki Hyun Tae , Yuji Roh , Young Hun Oh , Hyunsu Kim , Steven Euijong Whang

Bayesian uncertainty quantification for data-driven equation learning

Equation learning aims to infer differential equation models from data. While a number of studies have shown that differential equation models can be successfully identified when the data are sufficiently detailed and corrupted with…

Quantitative Methods · Quantitative Biology 2021-09-30 Simon Martina-Perez , Matthew J. Simpson , Ruth E. Baker

The Dynamic of Consensus in Deep Networks and the Identification of Noisy Labels

Deep neural networks have incredible capacity and expressibility, and can seemingly memorize any training set. This introduces a problem when training in the presence of noisy labels, as the noisy examples cannot be distinguished from clean…

Machine Learning · Computer Science 2022-10-04 Daniel Shwartz , Uri Stern , Daphna Weinshall

Learning without Prejudice: Avoiding Bias in Webly-Supervised Action Recognition

Webly-supervised learning has recently emerged as an alternative paradigm to traditional supervised learning based on large-scale datasets with manual annotations. The key idea is that models such as CNNs can be learned from the noisy…

Computer Vision and Pattern Recognition · Computer Science 2017-09-08 Christian Rupprecht , Ansh Kapil , Nan Liu , Lamberto Ballan , Federico Tombari

Towards Accelerated Model Training via Bayesian Data Selection

Mislabeled, duplicated, or biased data in real-world scenarios can lead to prolonged training and even hinder model convergence. Traditional solutions prioritizing easy or hard samples lack the flexibility to handle such a variety…

Machine Learning · Computer Science 2023-11-08 Zhijie Deng , Peng Cui , Jun Zhu

On the Relative Trust between Inconsistent Data and Inaccurate Constraints

Functional dependencies (FDs) specify the intended data semantics while violations of FDs indicate deviation from these semantics. In this paper, we study a data cleaning problem in which the FDs may not be completely correct, e.g., due to…

Databases · Computer Science 2012-07-25 George Beskales , Ihab F. Ilyas , Lukasz Golab , Artur Galiullin