Related papers: Detecting Data Errors with Statistical Constraints

Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables

Data cleaning is a long-standing challenge in data management. While powerful logic and statistical algorithms have been developed to detect and repair data errors in tables, existing algorithms predominantly rely on domain-experts to first…

Databases · Computer Science 2025-04-16 Qixu Chen , Yeye He , Raymond Chi-Wing Wong , Weiwei Cui , Song Ge , Haidong Zhang , Dongmei Zhang , Surajit Chaudhuri

SCORE: Soft Label Compression-Centric Dataset Condensation via Coding Rate Optimization

Dataset Condensation (DC) aims to obtain a condensed dataset that allows models trained on the condensed dataset to achieve performance comparable to those trained on the full dataset. Recent DC approaches increasingly focus on encoding…

Computer Vision and Pattern Recognition · Computer Science 2025-03-19 Bowen Yuan , Yuxia Fu , Zijian Wang , Yadan Luo , Zi Huang

Towards Robust Federated Analytics via Differentially Private Measurements of Statistical Heterogeneity

Statistical heterogeneity is a measure of how skewed the samples of a dataset are. It is a common problem in the study of differential privacy that the usage of a statistically heterogeneous dataset results in a significant loss of…

Machine Learning · Computer Science 2024-12-02 Mary Scott , Graham Cormode , Carsten Maple

Causality on Cross-Sectional Data: Stable Specification Search in Constrained Structural Equation Modeling

Causal modeling has long been an attractive topic for many researchers and in recent decades there has seen a surge in theoretical development and discovery algorithms. Generally discovery algorithms can be divided into two approaches:…

Machine Learning · Statistics 2017-02-06 Ridho Rahmadi , Perry Groot , Marianne Heins , Hans Knoop , Tom Heskes

A Flexible System for Automatic Quality Control of Oceanographic Data

Sampling errors are inevitable when measuring the ocean; thus, to achieve a trustable set of observations requires a quality control (QC) procedure capable to detect spurious data. While manual QC by human experts minimizes errors, it is…

Atmospheric and Oceanic Physics · Physics 2021-05-04 Guilherme P. Castelão

Conformal Segmentation in Industrial Surface Defect Detection with Statistical Guarantees

In industrial settings, surface defects on steel can significantly compromise its service life and elevate potential safety risks. Traditional defect detection methods predominantly rely on manual inspection, which suffers from low…

Machine Learning · Computer Science 2025-04-25 Cheng Shen , Yuewei Liu

SELECT: Detecting Label Errors in Real-world Scene Text Data

We introduce SELECT (Scene tExt Label Errors deteCTion), a novel approach that leverages multi-modal training to detect label errors in real-world scene text datasets. Utilizing an image-text encoder and a character-level tokenizer, SELECT…

Computer Vision and Pattern Recognition · Computer Science 2025-12-17 Wenjun Liu , Qian Wu , Yifeng Hu , Yuke Li

Comparing Shape-Constrained Regression Algorithms for Data Validation

Industrial and scientific applications handle large volumes of data that render manual validation by humans infeasible. Therefore, we require automated data validation approaches that are able to consider the prior knowledge of domain…

Machine Learning · Computer Science 2023-03-10 Florian Bachinger , Gabriel Kronberger

Sensitive Information Detection: Recursive Neural Networks for Encoding Context

The amount of data for processing and categorization grows at an ever increasing rate. At the same time the demand for collaboration and transparency in organizations, government and businesses, drives the release of data from internal…

Machine Learning · Computer Science 2020-08-26 Jan Neerbek

Constraint-based Causal Discovery from Multiple Interventions over Overlapping Variable Sets

Scientific practice typically involves repeatedly studying a system, each time trying to unravel a different perspective. In each study, the scientist may take measurements under different experimental conditions (interventions,…

Machine Learning · Statistics 2014-03-11 Sofia Triantafillou , Ioannis Tsamardinos

Shape Constraints in Symbolic Regression using Penalized Least Squares

We study the addition of shape constraints (SC) and their consideration during the parameter identification step of symbolic regression (SR). SC serve as a means to introduce prior knowledge about the shape of the otherwise unknown model…

Machine Learning · Computer Science 2024-08-07 Viktor Martinek , Julia Reuter , Ophelia Frotscher , Sanaz Mostaghim , Markus Richter , Roland Herzog

Explainable Data Imputation using Constraints

Data values in a dataset can be missing or anomalous due to mishandling or human error. Analysing data with missing values can create bias and affect the inferences. Several analysis methods, such as principle components analysis or…

Artificial Intelligence · Computer Science 2022-05-11 Sandeep Hans , Diptikalyan Saha , Aniya Aggarwal

Sequential Correct Screening and Post-Screening Inference

Selecting the top-$m$ variables with the $m$ largest population parameters from a larger set of candidates is a fundamental problem in statistics. In this paper, we propose a novel methodology called Sequential Correct Screening (SCS),…

Methodology · Statistics 2025-08-21 Masaki Toyoda , Yoshimasa Uematsu

Data Consistency Approach to Model Validation

In scientific inference problems, the underlying statistical modeling assumptions have a crucial impact on the end results. There exist, however, only a few automatic means for validating these fundamental modelling assumptions. The…

Methodology · Statistics 2019-05-21 Andreas Svensson , Dave Zachariah , Petre Stoica , Thomas B. Schön

Differentiable Constraint-Based Causal Discovery

Causal discovery from observational data is a fundamental task in artificial intelligence, with far-reaching implications for decision-making, predictions, and interventions. Despite significant advances, existing methods can be broadly…

Machine Learning · Computer Science 2026-02-06 Jincheng Zhou , Mengbo Wang , Anqi He , Yumeng Zhou , Hessam Olya , Murat Kocaoglu , Bruno Ribeiro

SECODA: Segmentation- and Combination-Based Detection of Anomalies

This study introduces SECODA, a novel general-purpose unsupervised non-parametric anomaly detection algorithm for datasets containing continuous and categorical attributes. The method is guaranteed to identify cases with unique or sparse…

Databases · Computer Science 2020-08-18 Ralph Foorthuis

Detecting Errors in a Numerical Response via any Regression Model

Noise plagues many numerical datasets, where the recorded values in the data may fail to match the true underlying values due to reasons including: erroneous sensors, data entry/processing mistakes, or imperfect human estimates. We consider…

Machine Learning · Statistics 2024-03-14 Hang Zhou , Jonas Mueller , Mayank Kumar , Jane-Ling Wang , Jing Lei

SCADE: Scalable Framework for Anomaly Detection in High-Performance System

As command-line interfaces remain integral to high-performance computing environments, the risk of exploitation through stealthy and complex command-line abuse grows. Conventional security solutions struggle to detect these anomalies due to…

Cryptography and Security · Computer Science 2024-12-10 Vaishali Vinay , Anjali Mangal

CSED: A Chinese Semantic Error Diagnosis Corpus

Recently, much Chinese text error correction work has focused on Chinese Spelling Check (CSC) and Chinese Grammatical Error Diagnosis (CGED). In contrast, little attention has been paid to the complicated problem of Chinese Semantic Error…

Computation and Language · Computer Science 2023-05-10 Bo Sun , Baoxin Wang , Yixuan Wang , Wanxiang Che , Dayong Wu , Shijin Wang , Ting Liu

ED2: Two-stage Active Learning for Error Detection -- Technical Report

Traditional error detection approaches require user-defined parameters and rules. Thus, the user has to know both the error detection system and the data. However, we can also formulate error detection as a semi-supervised classification…

Machine Learning · Computer Science 2019-08-20 Felix Neutatz , Mohammad Mahdavi , Ziawasch Abedjan