Related papers: Automatic String Data Validation with Pattern Disc…

Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes

Complex data pipelines are increasingly common in diverse applications such as BI reporting and ML modeling. These pipelines often recur regularly (e.g., daily or weekly), as BI reports need to be refreshed, and ML models need to be…

Databases · Computer Science 2021-04-14 Jie Song , Yeye He

Challenges and Solutions to Build a Data Pipeline to Identify Anomalies in Enterprise System Performance

We discuss how VMware is solving the following challenges to harness data to operate our ML-based anomaly detection system to detect performance issues in our Software Defined Data Center (SDDC) enterprise deployments: (i) label scarcity…

Machine Learning · Computer Science 2021-12-17 Xiaobo Huang , Amitabha Banerjee , Chien-Chia Chen , Chengzhi Huang , Tzu Yi Chuang , Abhishek Srivastava , Razvan Cheveresan

SPADE: Synthesizing Data Quality Assertions for Large Language Model Pipelines

Large language models (LLMs) are being increasingly deployed as part of pipelines that repeatedly process or generate data of some sort. However, a common barrier to deployment are the frequent and often unpredictable errors that plague…

Databases · Computer Science 2024-04-02 Shreya Shankar , Haotian Li , Parth Asawa , Madelon Hulsebos , Yiming Lin , J. D. Zamfirescu-Pereira , Harrison Chase , Will Fu-Hinthorn , Aditya G. Parameswaran , Eugene Wu

On the experiences of adopting automated data validation in an industrial machine learning project

Background: Data errors are a common challenge in machine learning (ML) projects and generally cause significant performance degradation in ML-enabled software systems. To ensure early detection of erroneous data and avoid training ML…

Software Engineering · Computer Science 2021-03-09 Lucy Ellen Lwakatare , Ellinor Rånge , Ivica Crnkovic , Jan Bosch

Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data Pipelines

Data pipelines are widely employed in modern enterprises to power a variety of Machine-Learning (ML) and Business-Intelligence (BI) applications. Crucially, these pipelines are \emph{recurring} (e.g., daily or hourly) in production settings…

Databases · Computer Science 2023-06-06 Dezhan Tu , Yeye He , Weiwei Cui , Song Ge , Haidong Zhang , Han Shi , Dongmei Zhang , Surajit Chaudhuri

Moving Fast With Broken Data

Machine learning (ML) models in production pipelines are frequently retrained on the latest partitions of large, continually-growing datasets. Due to engineering bugs, partitions in such datasets almost always have some corrupted features;…

Databases · Computer Science 2023-03-13 Shreya Shankar , Labib Fawaz , Karl Gyllstrom , Aditya G. Parameswaran

Supervised Anomaly Detection in Uncertain Pseudoperiodic Data Streams

Uncertain data streams have been widely generated in many Web applications. The uncertainty in data streams makes anomaly detection from sensor data streams far more challenging. In this paper, we present a novel framework that supports…

Artificial Intelligence · Computer Science 2016-07-21 Jiangang Ma , Le Sun , Hua Wang , Yanchun Zhang , Uwe Aickelin

Automated Multi-Source Debugging and Natural Language Error Explanation for Dashboard Applications

Modern web dashboards and enterprise applications increasingly rely on complex, distributed microservices architectures. While these architectures offer scalability, they introduce significant challenges in debugging and observability. When…

Software Engineering · Computer Science 2026-02-18 Devendra Tata , Mona Rajhans

An Optimized Pattern Recognition Algorithm for Anomaly Detection in IoT Environment

With the advent of large-scale heterogeneous search engines comes the problem of unified search control resulting in mismatches that could have otherwise avoided. A mechanism is needed to determine exact patterns in web mining and…

Cryptography and Security · Computer Science 2019-01-28 Nazim Uddin Sheikh , Hasina Rahman , Hamid Al-Qahtani

Deep Learning Approach to Anomaly Detection in Enterprise ETL Processes with Autoencoders

An anomaly detection method based on deep autoencoders is proposed to address anomalies that often occur in enterprise-level ETL data streams. The study first analyzes multiple types of anomalies in ETL processes, including delays, missing…

Machine Learning · Computer Science 2025-11-04 Xin Chen , Saili Uday Gadgil , Kangning Gao , Yi Hu , Cong Nie

Enabling Automatic Repair of Source Code Vulnerabilities Using Data-Driven Methods

Users around the world rely on software-intensive systems in their day-to-day activities. These systems regularly contain bugs and security vulnerabilities. To facilitate bug fixing, data-driven models of automatic program repair use pairs…

Software Engineering · Computer Science 2022-02-08 Anastasiia Grishina

Recognizing Variables from their Data via Deep Embeddings of Distributions

A key obstacle in automated analytics and meta-learning is the inability to recognize when different datasets contain measurements of the same variable. Because provided attribute labels are often uninformative in practice, this task may be…

Machine Learning · Computer Science 2019-09-12 Jonas Mueller , Alex Smola

SQLCheck: Automated Detection and Diagnosis of SQL Anti-Patterns

The emergence of database-as-a-service platforms has made deploying database applications easier than before. Now, developers can quickly create scalable applications. However, designing performant, maintainable, and accurate applications…

Databases · Computer Science 2020-04-23 Visweswara Sai Prashanth Dintyala , Arpit Narechania , Joy Arulraj

A Generic Approach to Detect Design Patterns in Model Transformations Using a String-Matching Algorithm

Maintaining software artifacts is among the hardest tasks an engineer faces. Like any other piece of code, model transformations developed by engineers are also subject to maintenance. To facilitate the comprehension of programs, software…

Software Engineering · Computer Science 2020-10-13 Chihab eddine Mokaddem , Houari Sahraoui , Eugene Syriani

Fix your Models by Fixing your Datasets

The quality of underlying training data is very crucial for building performant machine learning models with wider generalizabilty. However, current machine learning (ML) tools lack streamlined processes for improving the data quality. So,…

Machine Learning · Computer Science 2021-12-16 Atindriyo Sanyal , Vikram Chatterji , Nidhi Vyas , Ben Epstein , Nikita Demir , Anthony Corletti

AI Total: Analyzing Security ML Models with Imperfect Data in Production

Development of new machine learning models is typically done on manually curated data sets, making them unsuitable for evaluating the models' performance during operations, where the evaluation needs to be performed automatically on…

Machine Learning · Computer Science 2021-10-15 Awalin Sopan , Konstantin Berlin

Towards Explainable Automated Data Quality Enhancement without Domain Knowledge

In the era of big data, ensuring the quality of datasets has become increasingly crucial across various domains. We propose a comprehensive framework designed to automatically assess and rectify data quality issues in any given dataset,…

Databases · Computer Science 2024-09-17 Djibril Sarr

Deep Learning for Anomaly Detection in Log Data: A Survey

Automatic log file analysis enables early detection of relevant incidents such as system failures. In particular, self-learning anomaly detection techniques capture patterns in log data and subsequently report unexpected log event…

Machine Learning · Computer Science 2023-05-16 Max Landauer , Sebastian Onder , Florian Skopik , Markus Wurzenberger

Building an Automated and Self-Aware Anomaly Detection System

Organizations rely heavily on time series metrics to measure and model key aspects of operational and business performance. The ability to reliably detect issues with these metrics is imperative to identifying early indicators of major…

Machine Learning · Computer Science 2020-11-11 Sayan Chakraborty , Smit Shah , Kiumars Soltani , Anna Swigart , Luyao Yang , Kyle Buckingham

Automated Data Slicing for Model Validation:A Big data - AI Integration Approach

As machine learning systems become democratized, it becomes increasingly important to help users easily debug their models. However, current data tools are still primitive when it comes to helping users trace model performance problems all…

Databases · Computer Science 2019-01-08 Yeounoh Chung , Tim Kraska , Neoklis Polyzotis , Ki Hyun Tae , Steven Euijong Whang