English
Related papers

Related papers: A probabilistic database approach to autoencoder-b…

200 papers

Lack of data and data quality issues are among the main bottlenecks that prevent further artificial intelligence adoption within many organizations, pushing data scientists to spend most of their time cleaning data before being able to…

Databases · Computer Science 2020-11-11 Paulo H. Oliveira , Daniel S. Kaster , Caetano Traina-Jr. , Ihab F. Ilyas

We introduce HoloClean, a framework for holistic data repairing driven by probabilistic inference. HoloClean unifies existing qualitative data repairing approaches, which rely on integrity constraints or external data sources, with…

Databases · Computer Science 2017-02-06 Theodoros Rekatsinas , Xu Chu , Ihab F. Ilyas , Christopher Ré

Medical datasets are particularly subject to attribute noise, that is, missing and erroneous values. Attribute noise is known to be largely detrimental to learning performances. To maximize future learning performances it is primordial to…

Machine Learning · Computer Science 2022-06-23 Thomas Ranvier , Haytham Elgazel , Emmanuel Coquery , Khalid Benabdeslem

Benchmark datasets in computer vision often contain off-topic images, near duplicates, and label errors, leading to inaccurate estimates of model performance. In this paper, we revisit the task of data cleaning and formalize it as either a…

Recent efforts in data cleaning of structured data have focused exclusively on problems like data deduplication, record matching, and data standardization; none of the approaches addressing these problems focus on fixing incorrect attribute…

Databases · Computer Science 2015-07-01 Sushovan De , Yuheng Hu , Meduri Venkata Vamsikrishna , Yi Chen , Subbarao Kambhampati

We propose a new probabilistic method for unsupervised recovery of corrupted data. Given a large ensemble of degraded samples, our method recovers accurate posteriors of clean values, allowing the exploration of the manifold of possible…

Machine Learning · Computer Science 2020-07-01 Francesco Tonolini , Pablo G. Moreno , Andreas Damianou , Roderick Murray-Smith

Most theoretical frameworks that focus on data errors and inconsistencies follow logic-based reasoning. Yet, practical data cleaning tools need to incorporate statistical reasoning to be effective in real-world data cleaning tasks.…

Databases · Computer Science 2019-01-28 Christopher De Sa , Ihab F. Ilyas , Benny Kimelfeld , Christopher Re , Theodoros Rekatsinas

Dealing with missing data in data analysis is inevitable. Although powerful imputation methods that address this problem exist, there is still much room for improvement. In this study, we examined single imputation based on deep…

Machine Learning · Computer Science 2020-04-07 Najmeh Abiri , Björn Linse , Patrik Edén , Mattias Ohlsson

In the era of big data, ensuring the quality of datasets has become increasingly crucial across various domains. We propose a comprehensive framework designed to automatically assess and rectify data quality issues in any given dataset,…

Databases · Computer Science 2024-09-17 Djibril Sarr

Data Cleaning is a long standing problem, which is growing in importance with the mass of uncurated web data. State of the art approaches for handling inconsistent data are systems that learn and use conditional functional dependencies…

Databases · Computer Science 2012-04-18 Yuheng Hu , Sushovan De , Yi Chen , Subbarao Kambhampati

A probabilistic database with attribute-level uncertainty consists of relations where cells of some attributes may hold probability distributions rather than deterministic content. Such databases arise, implicitly or explicitly, in the…

Databases · Computer Science 2022-12-26 Amir Gilad , Aviram Imber , Benny Kimelfeld

This paper discusses an approach with machine-learning probability models to evaluate the difference between good and bad data quality in a dataset. A decision tree algorithm is used to predict data quality based on no domain knowledge of…

Machine Learning · Computer Science 2020-09-16 Allen ONeill

Graph databases are becoming widely successful as data models that allow to effectively represent and process complex relationships among various types of data. As with any other type of data repository, graph databases may suffer from…

Databases · Computer Science 2023-07-14 Sergio Abriola , Santiago Cifuentes , María Vanina Martínez , Nina Pardal , Edwin Pin

Data cleaning is naturally framed as probabilistic inference in a generative model of ground-truth data and likely errors, but the diversity of real-world error patterns and the hardness of inference make Bayesian approaches difficult to…

Machine Learning · Computer Science 2022-11-22 Alexander K. Lew , Monica Agrawal , David Sontag , Vikash K. Mansinghka

The performance of deep learning models for music source separation heavily depends on training data quality. However, datasets are often corrupted by difficult-to-detect artifacts such as audio bleeding and label noise. Since the type and…

Audio and Speech Processing · Electrical Eng. & Systems 2025-10-20 Azalea Gui , Woosung Choi , Junghyun Koo , Kazuki Shimada , Takashi Shibuya , Joan Serrà , Wei-Hsiang Liao , Yuki Mitsufuji

Improving data quality in unstructured documents is a long-standing challenge. Unstructured data, especially in textual form, inherently lacks defined semantics, which poses significant challenges for effective processing and for ensuring…

Databases · Computer Science 2025-02-26 Besat Kassaie , Frank Wm. Tompa

Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have an often-overlooked confounding effect on the assessment of model performance. Nevertheless, employing experts to…

Data cleaning is a long-standing challenge in data management. While powerful logic and statistical algorithms have been developed to detect and repair data errors in tables, existing algorithms predominantly rely on domain-experts to first…

There is a considerable body of work on data cleaning which employs various principles to rectify erroneous data and transform a dirty dataset into a cleaner one. One of prevalent approaches is probabilistic methods, including Bayesian…

Artificial Intelligence · Computer Science 2023-11-14 Jianbin Qin , Sifan Huang , Yaoshu Wang , Jing Zhu , Yifan Zhang , Yukai Miao , Rui Mao , Makoto Onizuka , Chuan Xiao

Data quality issues such as off-topic samples, near duplicates, and label errors often limit the performance of audio-based systems. This paper addresses these issues by adapting SelfClean, a representation-to-rank data auditing framework,…

‹ Prev 1 2 3 10 Next ›