Related papers: Auto-Validate: Unsupervised Data Validation Using …

Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data Pipelines

Data pipelines are widely employed in modern enterprises to power a variety of Machine-Learning (ML) and Business-Intelligence (BI) applications. Crucially, these pipelines are \emph{recurring} (e.g., daily or hourly) in production settings…

Databases · Computer Science 2023-06-06 Dezhan Tu , Yeye He , Weiwei Cui , Song Ge , Haidong Zhang , Han Shi , Dongmei Zhang , Surajit Chaudhuri

Auto-Tag: Tagging-Data-By-Example in Data Lakes

As data lakes become increasingly popular in large enterprises today, there is a growing need to tag or classify data assets (e.g., files and databases) in data lakes with additional metadata (e.g., semantic column-types), as the inferred…

Databases · Computer Science 2021-12-14 Yeye He , Jie Song , Yue Wang , Surajit Chaudhuri , Vishal Anil , Blake Lassiter , Yaron Goland , Gaurav Malhotra

Automatic String Data Validation with Pattern Discovery

In enterprise data pipelines, data insertions occur periodically and may impact downstream services if data quality issues are not addressed. Typically, such problems can be investigated and fixed by on-call engineers, but locating the…

Databases · Computer Science 2024-08-07 Xinwei Lin , Jing Zhao , Peng Di , Chuan Xiao , Rui Mao , Yan Ji , Makoto Onizuka , Zishuo Ding , Weiyi Shang , Jianbin Qin

Towards Unsupervised Validation of Anomaly-Detection Models

Unsupervised validation of anomaly-detection models is a highly challenging task. While the common practices for model validation involve a labeled validation set, such validation sets cannot be constructed when the underlying datasets are…

Machine Learning · Computer Science 2025-01-06 Lihi Idan

AI Total: Analyzing Security ML Models with Imperfect Data in Production

Development of new machine learning models is typically done on manually curated data sets, making them unsuitable for evaluating the models' performance during operations, where the evaluation needs to be performed automatically on…

Machine Learning · Computer Science 2021-10-15 Awalin Sopan , Konstantin Berlin

Enhancing Machine Learning Performance through Intelligent Data Quality Assessment: An Unsupervised Data-centric Framework

Poor data quality limits the advantageous power of Machine Learning (ML) and weakens high-performing ML software systems. Nowadays, data are more prone to the risk of poor quality due to their increasing volume and complexity. Therefore,…

Machine Learning · Computer Science 2025-02-20 Manal Rahal , Bestoun S. Ahmed , Gergely Szabados , Torgny Fornstedt , Jorgen Samuelsson

Moving Fast With Broken Data

Machine learning (ML) models in production pipelines are frequently retrained on the latest partitions of large, continually-growing datasets. Due to engineering bugs, partitions in such datasets almost always have some corrupted features;…

Databases · Computer Science 2023-03-13 Shreya Shankar , Labib Fawaz , Karl Gyllstrom , Aditya G. Parameswaran

Towards Explainable Automated Data Quality Enhancement without Domain Knowledge

In the era of big data, ensuring the quality of datasets has become increasingly crucial across various domains. We propose a comprehensive framework designed to automatically assess and rectify data quality issues in any given dataset,…

Databases · Computer Science 2024-09-17 Djibril Sarr

Auto-Pipeline: Synthesizing Complex Data Pipelines By-Target Using Reinforcement Learning and Search

Recent work has made significant progress in helping users to automate single data preparation steps, such as string-transformations and table-manipulation operators (e.g., Join, GroupBy, Pivot, etc.). We in this work propose to automate…

Databases · Computer Science 2021-08-05 Junwen Yang , Yeye He , Surajit Chaudhuri

Analytical Engines With Context-Rich Processing: Towards Efficient Next-Generation Analytics

As modern data pipelines continue to collect, produce, and store a variety of data formats, extracting and combining value from traditional and context-rich sources such as strings, text, video, audio, and logs becomes a manual process…

Databases · Computer Science 2023-12-05 Viktor Sanca , Anastasia Ailamaki

AutoML in Cybersecurity: An Empirical Study

Automated machine learning (AutoML) has emerged as a promising paradigm for automating machine learning (ML) pipeline design, broadening AI adoption. Yet its reliability in complex domains such as cybersecurity remains underexplored. This…

Cryptography and Security · Computer Science 2025-09-30 Sherif Saad , Kevin Shi , Mohammed Mamun , Hythem Elmiligi

On the experiences of adopting automated data validation in an industrial machine learning project

Background: Data errors are a common challenge in machine learning (ML) projects and generally cause significant performance degradation in ML-enabled software systems. To ensure early detection of erroneous data and avoid training ML…

Software Engineering · Computer Science 2021-03-09 Lucy Ellen Lwakatare , Ellinor Rånge , Ivica Crnkovic , Jan Bosch

Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines"

Data scientists develop ML pipelines in an iterative manner: they repeatedly screen a pipeline for potential issues, debug it, and then revise and improve its code according to their findings. However, this manual process is tedious and…

Databases · Computer Science 2024-05-01 Stefan Grafberger , Paul Groth , Sebastian Schelter

Instrumentation and Analysis of Native ML Pipelines via Logical Query Plans

Machine Learning (ML) is increasingly used to automate impactful decisions, which leads to concerns regarding their correctness, reliability, and fairness. We envision highly-automated software platforms to assist data scientists with…

Databases · Computer Science 2024-09-04 Stefan Grafberger

A Framework for Cryptographic Verifiability of End-to-End AI Pipelines

The increasing integration of Artificial Intelligence across multiple industry sectors necessitates robust mechanisms for ensuring transparency, trust, and auditability of its development and deployment. This topic is particularly important…

Cryptography and Security · Computer Science 2025-03-31 Kar Balan , Robert Learney , Tim Wood

AutoPureData: Automated Filtering of Undesirable Web Data to Update LLM Knowledge

Up-to-date and reliable language models are consistently sought after and are essential in various applications. Typically, models are trained on a fixed dataset and then deployed globally. However, the knowledge of the models becomes…

Computation and Language · Computer Science 2025-02-28 Praneeth Vadlapati

Benchmark and Survey of Automated Machine Learning Frameworks

Machine learning (ML) has become a vital part in many aspects of our daily life. However, building well performing machine learning applications requires highly specialized data scientists and domain experts. Automated machine learning…

Machine Learning · Computer Science 2021-01-27 Marc-André Zöller , Marco F. Huber

AI-Driven Frameworks for Enhancing Data Quality in Big Data Ecosystems: Error_Detection, Correction, and Metadata Integration

The widespread adoption of big data has ushered in a new era of data-driven decision-making, transforming numerous industries and sectors. However, the efficacy of these decisions hinges on the quality of the underlying data. Poor data…

Artificial Intelligence · Computer Science 2024-05-08 Widad Elouataoui

Automated Vulnerability Validation and Verification: A Large Language Model Approach

Software vulnerabilities remain a critical security challenge, providing entry points for attackers into enterprise networks. Despite advances in security practices, the lack of high-quality datasets capturing diverse exploit behavior…

Cryptography and Security · Computer Science 2025-11-17 Alireza Lotfi , Charalampos Katsis , Elisa Bertino

AutoCure: Automated Tabular Data Curation Technique for ML Pipelines

Machine learning algorithms have become increasingly prevalent in multiple domains, such as autonomous driving, healthcare, and finance. In such domains, data preparation remains a significant challenge in developing accurate models,…

Databases · Computer Science 2023-04-27 Mohamed Abdelaal , Rashmi Koparde , Harald Schoening