Related papers: Statistical Dataset Evaluation: Reliability, Diffi…

Towards a context-dependent numerical data quality evaluation framework

This paper focuses on numeric data, with emphasis on distinct characteristics like varying significance, unstructured format, mass volume and real-time processing. We propose a novel, context-dependent valuation framework specifically…

Databases · Computer Science 2018-10-23 Milen S. Marev , Ernesto Compatangelo , Wamberto Vasconcelos

Data Quality for Software Vulnerability Datasets

The use of learning-based techniques to achieve automated software vulnerability detection has been of longstanding interest within the software security domain. These data-driven solutions are enabled by large software vulnerability…

Software Engineering · Computer Science 2023-01-16 Roland Croft , M. Ali Babar , Mehdi Kholoosi

Assessing Dataset Quality Through Decision Tree Characteristics in Autoencoder-Processed Spaces

In this paper, we delve into the critical aspect of dataset quality assessment in machine learning classification tasks. Leveraging a variety of nine distinct datasets, each crafted for classification tasks with varying complexity levels,…

Machine Learning · Computer Science 2023-06-28 Szymon Mazurek , Maciej Wielgosz

A Novel Metric for Measuring Data Quality in Classification Applications (extended version)

Data quality is a key element for building and optimizing good learning models. Despite many attempts to characterize data quality, there is still a need for rigorous formalization and an efficient measure of the quality from available…

Machine Learning · Computer Science 2023-12-14 Jouseau Roxane , Salva Sébastien , Samir Chafik

How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models

Devising domain- and model-agnostic evaluation metrics for generative models is an important and as yet unresolved problem. Most existing metrics, which were tailored solely to the image synthesis setup, exhibit a limited capacity for…

Machine Learning · Computer Science 2022-07-14 Ahmed M. Alaa , Boris van Breugel , Evgeny Saveliev , Mihaela van der Schaar

Statistical Learning to Operationalize a Domain Agnostic Data Quality Scoring

Data is expanding at an unimaginable rate, and with this development comes the responsibility of the quality of data. Data Quality refers to the relevance of the information present and helps in various operations like decision making and…

Machine Learning · Computer Science 2021-11-30 Sezal Chug , Priya Kaushal , Ponnurangam Kumaraguru , Tavpritesh Sethi

Evaluating the Success of a Data Analysis

A fundamental problem in the practice and teaching of data science is how to evaluate the quality of a given data analysis, which is different than the evaluation of the science or question underlying the data analysis. Previously, we…

Other Statistics · Statistics 2019-04-29 Stephanie C. Hicks , Roger D. Peng

Towards Explainable Automated Data Quality Enhancement without Domain Knowledge

In the era of big data, ensuring the quality of datasets has become increasingly crucial across various domains. We propose a comprehensive framework designed to automatically assess and rectify data quality issues in any given dataset,…

Databases · Computer Science 2024-09-17 Djibril Sarr

Experience: Quality Benchmarking of Datasets Used in Software Effort Estimation

Data is a cornerstone of empirical software engineering (ESE) research and practice. Data underpin numerous process and project management activities, including the estimation of development effort and the prediction of the likely location…

Software Engineering · Computer Science 2020-12-22 Michael F. Bosu , Stephen G. MacDonell

Data Quality Issues in Vulnerability Detection Datasets

Vulnerability detection is a crucial yet challenging task to identify potential weaknesses in software for cyber security. Recently, deep learning (DL) has made great progress in automating the detection process. Due to the complex…

Cryptography and Security · Computer Science 2024-10-10 Yuejun Guo , Seifeddine Bettaieb

An Automated Analysis Framework for Trajectory Datasets

Trajectory datasets of road users have become more important in the last years for safety validation of highly automated vehicles. Several naturalistic trajectory datasets with each more than 10.000 tracks were released and others will…

Computer Vision and Pattern Recognition · Computer Science 2022-04-12 Christoph Glasmacher , Robert Krajewski , Lutz Eckstein

A Survey on Autonomous Driving Datasets: Statistics, Annotation Quality, and a Future Outlook

Autonomous driving has rapidly developed and shown promising performance due to recent advances in hardware and deep learning techniques. High-quality datasets are fundamental for developing reliable autonomous driving algorithms. Previous…

Computer Vision and Pattern Recognition · Computer Science 2024-04-24 Mingyu Liu , Ekim Yurtsever , Jonathan Fossaert , Xingcheng Zhou , Walter Zimmer , Yuning Cui , Bare Luka Zagar , Alois C. Knoll

Data Quality in Empirical Software Engineering: A Targeted Review

Context: The utility of prediction models in empirical software engineering (ESE) is heavily reliant on the quality of the data used in building those models. Several data quality challenges such as noise, incompleteness, outliers and…

Software Engineering · Computer Science 2021-05-25 Michael Franklin Bosu , Stephen G. MacDonell

Towards Dependability Metrics for Neural Networks

Artificial neural networks (NN) are instrumental in realizing highly-automated driving functionality. An overarching challenge is to identify best safety engineering practices for NN and other learning-enabled components. In particular,…

Machine Learning · Computer Science 2018-06-11 Chih-Hong Cheng , Georg Nührenberg , Chung-Hao Huang , Harald Ruess , Hirotoshi Yasuoka

Detecting Errors in a Numerical Response via any Regression Model

Noise plagues many numerical datasets, where the recorded values in the data may fail to match the true underlying values due to reasons including: erroneous sensors, data entry/processing mistakes, or imperfect human estimates. We consider…

Machine Learning · Statistics 2024-03-14 Hang Zhou , Jonas Mueller , Mayank Kumar , Jane-Ling Wang , Jing Lei

Developing a Dataset-Adaptive, Normalized Metric for Machine Learning Model Assessment: Integrating Size, Complexity, and Class Imbalance

Traditional metrics like accuracy, F1-score, and precision are frequently used to evaluate machine learning models, however they may not be sufficient for evaluating performance on tiny, unbalanced, or high-dimensional datasets. A…

Machine Learning · Computer Science 2024-12-11 Serzhan Ossenov

Coverage Testing of Deep Learning Models using Dataset Characterization

Deep Neural Networks (DNNs), with its promising performance, are being increasingly used in safety critical applications such as autonomous driving, cancer detection, and secure authentication. With growing importance in deep learning,…

Machine Learning · Computer Science 2019-11-19 Senthil Mani , Anush Sankaran , Srikanth Tamilselvam , Akshay Sethi

Improving Data Quality through Deep Learning and Statistical Models

Traditional data quality control methods are based on users experience or previously established business rules, and this limits performance in addition to being a very time consuming process with lower than desirable accuracy. Utilizing…

Artificial Intelligence · Computer Science 2018-10-17 Wei Dai , Kenji Yoshigoe , William Parsley

Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability

High-quality datasets are fundamental to training and evaluating machine learning models, yet their creation-especially with accurate human annotations-remains a significant challenge. Many dataset paper submissions lack originality,…

Machine Learning · Computer Science 2025-06-04 Genta Indra Winata , David Anugraha , Emmy Liu , Alham Fikri Aji , Shou-Yi Hung , Aditya Parashar , Patrick Amadeus Irawan , Ruochen Zhang , Zheng-Xin Yong , Jan Christian Blaise Cruz , Niklas Muennighoff , Seungone Kim , Hanyang Zhao , Sudipta Kar , Kezia Erina Suryoraharjo , M. Farid Adilazuarda , En-Shiun Annie Lee , Ayu Purwarianti , Derry Tanti Wijaya , Monojit Choudhury

Fix your Models by Fixing your Datasets

The quality of underlying training data is very crucial for building performant machine learning models with wider generalizabilty. However, current machine learning (ML) tools lack streamlined processes for improving the data quality. So,…

Machine Learning · Computer Science 2021-12-16 Atindriyo Sanyal , Vikram Chatterji , Nidhi Vyas , Ben Epstein , Nikita Demir , Anthony Corletti