Related papers: Guided Data Repair

Generative Data Refinement: Just Ask for Better Data

For a fixed parameter size, the capabilities of large models are primarily determined by the quality and quantity of its training data. Consequently, training datasets now grow faster than the rate at which new data is indexed on the web,…

Machine Learning · Computer Science 2025-09-12 Minqi Jiang , João G. M. Araújo , Will Ellsworth , Sian Gooding , Edward Grefenstette

Automatic Data Repair: Are We Ready to Deploy?

Data quality is paramount in today's data-driven world, especially in the era of generative AI. Dirty data with errors and inconsistencies usually leads to flawed insights, unreliable decision-making, and biased or low-quality outputs from…

Databases · Computer Science 2025-04-01 Wei Ni , Xiaoye Miao , Xiangyu Zhao , Yangyang Wu , Jianwei Yin

Data-Driven Feedback Generation for Introductory Programming Exercises

This paper introduces the "Search, Align, and Repair" data-driven program repair framework to automate feedback generation for introductory programming exercises. Distinct from existing techniques, our goal is to develop an efficient, fully…

Programming Languages · Computer Science 2017-11-21 Ke Wang , RIshabh Singh , Zhendong Su

Automatic Weighted Matching Rectifying Rule Discovery for Data Repairing

Data repairing is a key problem in data cleaning which aims to uncover and rectify data errors. Traditional methods depend on data dependencies to check the existence of errors in data, but they fail to rectify the errors. To overcome this…

Databases · Computer Science 2019-09-24 Hiba Abu Ahmad , Hongzhi Wang

Regularizing Neural Networks with Meta-Learning Generative Models

This paper investigates methods for improving generative data augmentation for deep learning. Generative data augmentation leverages the synthetic samples produced by generative models as an additional dataset for classification with small…

Machine Learning · Computer Science 2023-10-24 Shin'ya Yamaguchi , Daiki Chijiwa , Sekitoshi Kanai , Atsutoshi Kumagai , Hisashi Kashima

Automated Data Quality Validation in an End-to-End GNN Framework

Ensuring data quality is crucial in modern data ecosystems, especially for training or testing datasets in machine learning. Existing validation approaches rely on computing data quality metrics and/or using expert-defined constraints.…

Databases · Computer Science 2025-02-18 Sijie Dong , Soror Sahri , Themis Palpanas , Qitong Wang

Pattern-Driven Data Cleaning

Data is inherently dirty and there has been a sustained effort to come up with different approaches to clean it. A large class of data repair algorithms rely on data-quality rules and integrity constraints to detect and repair the data. A…

Databases · Computer Science 2017-12-29 El Kindi Rezig , Mourad Ouzzani , Walid G. Aref , Ahmed K. Elmagarmid , Ahmed R. Mahmood

Approximate Data Deletion in Generative Models

Users have the right to have their data deleted by third-party learned systems, as codified by recent legislation such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). Such data deletion can…

Machine Learning · Computer Science 2022-06-30 Zhifeng Kong , Scott Alfeld

Online Gradient Boosting Decision Tree: In-Place Updates for Efficient Adding/Deleting Data

Gradient Boosting Decision Tree (GBDT) is one of the most popular machine learning models in various applications. However, in the traditional settings, all data should be simultaneously accessed in the training procedure: it does not allow…

Machine Learning · Computer Science 2025-02-04 Huawei Lin , Jun Woo Chung , Yingjie Lao , Weijie Zhao

Enabling Automatic Repair of Source Code Vulnerabilities Using Data-Driven Methods

Users around the world rely on software-intensive systems in their day-to-day activities. These systems regularly contain bugs and security vulnerabilities. To facilitate bug fixing, data-driven models of automatic program repair use pairs…

Software Engineering · Computer Science 2022-02-08 Anastasiia Grishina

Guided Data Augmentation for Offline Reinforcement Learning and Imitation Learning

In offline reinforcement learning (RL), an RL agent learns to solve a task using only a fixed dataset of previously collected data. While offline RL has been successful in learning real-world robot control policies, it typically requires…

Machine Learning · Computer Science 2024-08-09 Nicholas E. Corrado , Yuxiao Qu , John U. Balis , Adam Labiosa , Josiah P. Hanna

A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance

Data cleaning is the initial stage of any machine learning project and is one of the most critical processes in data analysis. It is a critical step in ensuring that the dataset is devoid of incorrect or erroneous data. It can be done…

Databases · Computer Science 2021-09-16 Ga Young Lee , Lubna Alzamil , Bakhtiyar Doskenov , Arash Termehchy

Human-Centric Data Cleaning [Vision]

Data Cleaning refers to the process of detecting and fixing errors in the data. Human involvement is instrumental at several stages of this process, e.g., to identify and repair errors, to validate computed repairs, etc. There is currently…

Databases · Computer Science 2018-01-03 El Kindi Rezig , Mourad Ouzzani , Ahmed K. Elmagarmid , Walid G. Aref

Batchwise Probabilistic Incremental Data Cleaning

Lack of data and data quality issues are among the main bottlenecks that prevent further artificial intelligence adoption within many organizations, pushing data scientists to spend most of their time cleaning data before being able to…

Databases · Computer Science 2020-11-11 Paulo H. Oliveira , Daniel S. Kaster , Caetano Traina-Jr. , Ihab F. Ilyas

How Helpful do Novice Programmers Find the Feedback of an Automated Repair Tool?

Immediate feedback has been shown to improve student learning. In programming courses, immediate, automated feedback is typically provided in the form of pre-defined test cases run by a submission platform. While these are excellent for…

Computers and Society · Computer Science 2024-02-02 Oka Kurniawan , Christopher M. Poskitt , Ismam Al Hoque , Norman Tiong Seng Lee , Cyrille Jégourel , Nachamma Sockalingam

Hierarchical Group-wise Ranking Framework for Recommendation Models

In modern recommender systems, CTR/CVR models are increasingly trained with ranking objectives to improve item ranking quality. While this shift aligns training more closely with serving goals, most existing methods rely on in-batch…

Information Retrieval · Computer Science 2025-06-17 YaChen Yan , Liubo Li , Ravi Choudhary

Data Cleansing for Models Trained with SGD

Data cleansing is a typical approach used to improve the accuracy of machine learning models, which, however, requires extensive domain knowledge to identify the influential instances that affect the models. In this paper, we propose an…

Machine Learning · Statistics 2019-06-21 Satoshi Hara , Atsushi Nitanda , Takanori Maehara

Time Series Data Cleaning: From Anomaly Detection to Anomaly Repairing (Technical Report)

Errors are prevalent in time series data, such as GPS trajectories or sensor readings. Existing methods focus more on anomaly detection but not on repairing the detected anomalies. By simply filtering out the dirty data via anomaly…

Databases · Computer Science 2020-03-30 Aoqian Zhang , Shaoxu Song , Jianmin Wang , Philip S. Yu

Trust Enhancement Issues in Program Repair

Automated program repair is an emerging technology that seeks to automatically rectify bugs and vulnerabilities using learning, search, and semantic analysis. Trust in automatically generated patches is necessary for achieving greater…

Software Engineering · Computer Science 2022-02-14 Yannic Noller , Ridwan Shariffdeen , Xiang Gao , Abhik Roychoudhury

Learner-Tailored Program Repair: A Solution Generator with Iterative Edit-Driven Retrieval Enhancement

With the development of large language models (LLMs) in the field of programming, intelligent programming coaching systems have gained widespread attention. However, most research focuses on repairing the buggy code of programming learners…

Artificial Intelligence · Computer Science 2026-01-21 Zhenlong Dai , Zhuoluo Zhao , Hengning Wang , Xiu Tang , Sai Wu , Chang Yao , Zhipeng Gao , Jingyuan Chen