Detecting Data Errors with Statistical Constraints

Jing Nathan Yan; Oliver Schulte; Jiannan Wang; Reynold Cheng

Detecting Data Errors with Statistical Constraints

Databases 2019-02-27 v1

Authors: Jing Nathan Yan , Oliver Schulte , Jiannan Wang , Reynold Cheng

Abstract

A powerful approach to detecting erroneous data is to check which potentially dirty data records are incompatible with a user's domain knowledge. Previous approaches allow the user to specify domain knowledge in the form of logical constraints (e.g., functional dependency and denial constraints). We extend the constraint-based approach by introducing a novel class of statistical constraints (SCs). An SC treats each column as a random variable, and enforces an independence or dependence relationship between two (or a few) random variables. Statistical constraints are expressive, allowing the user to specify a wide range of domain knowledge, beyond traditional integrity constraints. Furthermore, they work harmoniously with downstream statistical modeling. We develop CODED, an SC-Oriented Data Error Detection system that supports three key tasks: (1) Checking whether an SC is violated or not on a given dataset, (2) Identify the top-k records that contribute the most to the violation of an SC, and (3) Checking whether a set of input SCs have conflicts or not. We present effective solutions for each task. Experiments on synthetic and real-world data illustrate how SCs apply to error detection, and provide evidence that CODED performs better than state-of-the-art approaches.

Keywords

program analysis verification anomaly detection

Cite

@article{arxiv.1902.09711,
  title  = {Detecting Data Errors with Statistical Constraints},
  author = {Jing Nathan Yan and Oliver Schulte and Jiannan Wang and Reynold Cheng},
  journal= {arXiv preprint arXiv:1902.09711},
  year   = {2019}
}

Detecting Data Errors with Statistical Constraints

Abstract

Keywords

Cite

Related papers