English
Related papers

Related papers: Batchwise Probabilistic Incremental Data Cleaning

200 papers

We introduce HoloClean, a framework for holistic data repairing driven by probabilistic inference. HoloClean unifies existing qualitative data repairing approaches, which rely on integrity constraints or external data sources, with…

Databases · Computer Science 2017-02-06 Theodoros Rekatsinas , Xu Chu , Ihab F. Ilyas , Christopher Ré

Data Cleaning refers to the process of detecting and fixing errors in the data. Human involvement is instrumental at several stages of this process, e.g., to identify and repair errors, to validate computed repairs, etc. There is currently…

Databases · Computer Science 2018-01-03 El Kindi Rezig , Mourad Ouzzani , Ahmed K. Elmagarmid , Walid G. Aref

Data quality is paramount in today's data-driven world, especially in the era of generative AI. Dirty data with errors and inconsistencies usually leads to flawed insights, unreliable decision-making, and biased or low-quality outputs from…

Databases · Computer Science 2025-04-01 Wei Ni , Xiaoye Miao , Xiangyu Zhao , Yangyang Wu , Jianwei Yin

Data quality problems are a large threat in data science. In this paper, we propose a data-cleaning autoencoder capable of near-automatic data quality improvement. It learns the structure and dependencies in the data and uses it as evidence…

Databases · Computer Science 2021-08-04 R. R. Mauritz , F. P. J. Nijweide , J. Goseling , M. van Keulen

Data cleaning is the initial stage of any machine learning project and is one of the most critical processes in data analysis. It is a critical step in ensuring that the dataset is devoid of incorrect or erroneous data. It can be done…

Databases · Computer Science 2021-09-16 Ga Young Lee , Lubna Alzamil , Bakhtiyar Doskenov , Arash Termehchy

Data is inherently dirty and there has been a sustained effort to come up with different approaches to clean it. A large class of data repair algorithms rely on data-quality rules and integrity constraints to detect and repair the data. A…

Databases · Computer Science 2017-12-29 El Kindi Rezig , Mourad Ouzzani , Walid G. Aref , Ahmed K. Elmagarmid , Ahmed R. Mahmood

In the era of big data, ensuring the quality of datasets has become increasingly crucial across various domains. We propose a comprehensive framework designed to automatically assess and rectify data quality issues in any given dataset,…

Databases · Computer Science 2024-09-17 Djibril Sarr

The wide use of machine learning is fundamentally changing the software development paradigm (a.k.a. Software 2.0) where data becomes a first-class citizen, on par with code. As machine learning is used in sensitive applications, it becomes…

Databases · Computer Science 2019-04-25 Ki Hyun Tae , Yuji Roh , Young Hun Oh , Hyunsu Kim , Steven Euijong Whang

In recent years, more and more large data sets have become available. Data accuracy, the absence of verifiable errors in data, is crucial for these large materials to enable high-quality research, downstream applications, and model…

Methodology · Statistics 2025-10-27 Väinö Yrjänäinen , Johan Jonasson , Måns Magnusson

Improving data quality in unstructured documents is a long-standing challenge. Unstructured data, especially in textual form, inherently lacks defined semantics, which poses significant challenges for effective processing and for ensuring…

Databases · Computer Science 2025-02-26 Besat Kassaie , Frank Wm. Tompa

Context: Machine Learning (ML) is integrated into a growing number of systems for various applications. Because the performance of an ML model is highly dependent on the quality of the data it has been trained on, there is a growing…

Machine Learning · Computer Science 2024-06-03 Pierre-Olivier Côté , Amin Nikanjam , Nafisa Ahmed , Dmytro Humeniuk , Foutse Khomh

The availability of both structured and unstructured databases, such as electronic health data, social media data, patent data, and surveys that are often updated in real time, among others, has grown rapidly over the past decade. With this…

Databases · Computer Science 2023-07-26 Rebecca C. Steorts

Data-centric AI is at the center of a fundamental shift in software engineering where machine learning becomes the new software, powered by big data and computing infrastructure. Here software engineering needs to be re-thought where data…

Machine Learning · Computer Science 2022-12-27 Steven Euijong Whang , Yuji Roh , Hwanjun Song , Jae-Gil Lee

One of the most important processing steps in any analysis pipeline is handling missing data. Traditional approaches simply delete any sample or feature with missing elements. Recent imputation methods replace missing data based on assumed…

Databases · Computer Science 2024-05-03 Kenneth Smith , Sharlee Climer

Efficient consistency maintenance of incomplete and dynamic real-life databases is a quality label for further data analysis. In prior work, we tackled the generic problem of database updating in the presence of tuple generating constraints…

Databases · Computer Science 2024-05-16 Jacques Chabin , Mirian Halfeld Ferrari , Nicolas Hiot , Dominique Laurent

The ability of artificial agents to increment their capabilities when confronted with new data is an open challenge in artificial intelligence. The main challenge faced in such cases is catastrophic forgetting, i.e., the tendency of neural…

Machine Learning · Computer Science 2020-12-16 Eden Belouadah , Adrian Popescu , Ioannis Kanellos

Data preparation, especially data cleaning, is very important to ensure data quality and to improve the output of automated decision systems. Since there is no single tool that covers all steps required, a combination of tools -- namely a…

Databases · Computer Science 2023-08-29 Valerie Restat

Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more…

Machine Learning · Computer Science 2019-08-13 Yuji Roh , Geon Heo , Steven Euijong Whang

There is a considerable body of work on data cleaning which employs various principles to rectify erroneous data and transform a dirty dataset into a cleaner one. One of prevalent approaches is probabilistic methods, including Bayesian…

Artificial Intelligence · Computer Science 2023-11-14 Jianbin Qin , Sifan Huang , Yaoshu Wang , Jing Zhu , Yifan Zhang , Yukai Miao , Rui Mao , Makoto Onizuka , Chuan Xiao

Missing data often exists in real-world datasets, requiring significant time and effort for data repair to learn accurate models. In this paper, we show that imputing all missing values is not always necessary to achieve an accurate ML…

Machine Learning · Computer Science 2026-03-19 Cheng Zhen , Prayoga , Nischal Aryal , Arash Termehchy , Garrett Biwer , Lubna Alzamil
‹ Prev 1 2 3 10 Next ›