Related papers: ActiveClean: Interactive Data Cleaning While Learn…

ActiveClean: Generating Line-Level Vulnerability Data via Active Learning

Deep learning vulnerability detection tools are increasing in popularity and have been shown to be effective. These tools rely on large volume of high quality training data, which are very hard to get. Most of the currently available…

Software Engineering · Computer Science 2023-12-05 Ashwin Kallingal Joshy , Mirza Sanjida Alam , Shaila Sharmin , Qi Li , Wei Le

Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach

The wide use of machine learning is fundamentally changing the software development paradigm (a.k.a. Software 2.0) where data becomes a first-class citizen, on par with code. As machine learning is used in sensitive applications, it becomes…

Databases · Computer Science 2019-04-25 Ki Hyun Tae , Yuji Roh , Young Hun Oh , Hyunsu Kim , Steven Euijong Whang

Active label cleaning for improved dataset quality under resource constraints

Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have an often-overlooked confounding effect on the assessment of model performance. Nevertheless, employing experts to…

Computer Vision and Pattern Recognition · Computer Science 2022-04-25 Melanie Bernhardt , Daniel C. Castro , Ryutaro Tanno , Anton Schwaighofer , Kerem C. Tezcan , Miguel Monteiro , Shruthi Bannur , Matthew Lungren , Aditya Nori , Ben Glocker , Javier Alvarez-Valle , Ozan Oktay

Active Sampler: Light-weight Accelerator for Complex Data Analytics at Scale

Recent years have witnessed amazing outcomes from "Big Models" trained by "Big Data". Most popular algorithms for model training are iterative. Due to the surging volumes of data, we can usually afford to process only a fraction of the…

Databases · Computer Science 2015-12-15 Jinyang Gao , H. V. Jagadish , Beng Chin Ooi

Data Cleansing for Models Trained with SGD

Data cleansing is a typical approach used to improve the accuracy of machine learning models, which, however, requires extensive domain knowledge to identify the influential instances that affect the models. In this paper, we propose an…

Machine Learning · Statistics 2019-06-21 Satoshi Hara , Atsushi Nitanda , Takanori Maehara

A Mathematical Analysis of Learning Loss for Active Learning in Regression

Active learning continues to remain significant in the industry since it is data efficient. Not only is it cost effective on a constrained budget, continuous refinement of the model allows for early detection and resolution of failure…

Computer Vision and Pattern Recognition · Computer Science 2021-09-06 Megh Shukla , Shuaib Ahmed

A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance

Data cleaning is the initial stage of any machine learning project and is one of the most critical processes in data analysis. It is a critical step in ensuring that the dataset is devoid of incorrect or erroneous data. It can be done…

Databases · Computer Science 2021-09-16 Ga Young Lee , Lubna Alzamil , Bakhtiyar Doskenov , Arash Termehchy

Learning Over Dirty Data Without Cleaning

Real-world datasets are dirty and contain many errors. Examples of these issues are violations of integrity constraints, duplicates, and inconsistencies in representing data values and entities. Learning over dirty databases may result in…

Databases · Computer Science 2020-04-07 Jose Picado , John Davis , Arash Termehchy , Ga Young Lee

BoostClean: Automated Error Detection and Repair for Machine Learning

Predictive models based on machine learning can be highly sensitive to data error. Training data are often combined with a variety of different sources, each susceptible to different types of inconsistencies, and new data streams during…

Databases · Computer Science 2017-11-07 Sanjay Krishnan , Michael J. Franklin , Ken Goldberg , Eugene Wu

Language Model-Driven Data Pruning Enables Efficient Active Learning

Active learning (AL) optimizes data labeling efficiency by selecting the most informative instances for annotation. A key component in this procedure is an acquisition function that guides the selection process and identifies the suitable…

Machine Learning · Computer Science 2024-10-08 Abdul Hameed Azeemi , Ihsan Ayyub Qazi , Agha Ali Raza

Smart Cuts: Enhance Active Learning for Vulnerability Detection by Pruning Hard-to-Learn Data

Vulnerability detection is crucial for identifying security weaknesses in software systems. However, training effective machine learning models for this task is often constrained by the high cost and expertise required for data annotation.…

Cryptography and Security · Computer Science 2025-08-19 Xiang Lan , Tim Menzies , Bowen Xu

On the Relationship between Data Efficiency and Error for Uncertainty Sampling

While active learning offers potential cost savings, the actual data efficiency---the reduction in amount of labeled data needed to obtain the same error rate---observed in practice is mixed. This paper poses a basic question: when is…

Machine Learning · Computer Science 2018-06-19 Stephen Mussmann , Percy Liang

Identifying Wrongly Predicted Samples: A Method for Active Learning

State-of-the-art machine learning models require access to significant amount of annotated data in order to achieve the desired level of performance. While unlabelled data can be largely available and even abundant, annotation process can…

Machine Learning · Computer Science 2020-10-15 Rahaf Aljundi , Nikolay Chumerin , Daniel Olmeda Reino

Active Robust Learning

In many practical applications of learning algorithms, unlabeled data is cheap and abundant whereas labeled data is expensive. Active learning algorithms developed to achieve better performance with lower cost. Usually Representativeness…

Machine Learning · Computer Science 2016-08-26 Hossein Ghafarian , Hadi Sadoghi Yazdi

An Active Learning Approach for Reducing Annotation Cost in Skin Lesion Analysis

Automated skin lesion analysis is very crucial in clinical practice, as skin cancer is among the most common human malignancy. Existing approaches with deep learning have achieved remarkable performance on this challenging task, however,…

Computer Vision and Pattern Recognition · Computer Science 2019-09-06 Xueying Shi , Qi Dou , Cheng Xue , Jing Qin , Hao Chen , Pheng-Ann Heng

DataAssist: A Machine Learning Approach to Data Cleaning and Preparation

Current automated machine learning (ML) tools are model-centric, focusing on model selection and parameter optimization. However, the majority of the time in data analysis is devoted to data cleaning and wrangling, for which limited tools…

Machine Learning · Computer Science 2023-07-18 Kartikay Goyle , Quin Xie , Vakul Goyle

Advancing Anomaly Detection in Computational Workflows with Active Learning

A computational workflow, also known as workflow, consists of tasks that are executed in a certain order to attain a specific computational campaign. Computational workflows are commonly employed in science domains, such as physics,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-13 Krishnan Raghavan , George Papadimitriou , Hongwei Jin , Anirban Mandal , Mariam Kiran , Prasanna Balaprakash , Ewa Deelman

Active Testing: Sample-Efficient Model Evaluation

We introduce a new framework for sample-efficient model evaluation that we call active testing. While approaches like active learning reduce the number of labels needed for model training, existing literature largely ignores the cost of…

Machine Learning · Statistics 2021-06-15 Jannik Kossen , Sebastian Farquhar , Yarin Gal , Tom Rainforth

CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

Data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly cleaning…

Databases · Computer Science 2021-04-07 Peng Li , Xi Rao , Jennifer Blase , Yue Zhang , Xu Chu , Ce Zhang

Data Cleaning and Machine Learning: A Systematic Literature Review

Context: Machine Learning (ML) is integrated into a growing number of systems for various applications. Because the performance of an ML model is highly dependent on the quality of the data it has been trained on, there is a growing…

Machine Learning · Computer Science 2024-06-03 Pierre-Olivier Côté , Amin Nikanjam , Nafisa Ahmed , Dmytro Humeniuk , Foutse Khomh