Related papers: Data Quality Evaluation using Probability Models

Assessing Dataset Quality Through Decision Tree Characteristics in Autoencoder-Processed Spaces

In this paper, we delve into the critical aspect of dataset quality assessment in machine learning classification tasks. Leveraging a variety of nine distinct datasets, each crafted for classification tasks with varying complexity levels,…

Machine Learning · Computer Science 2023-06-28 Szymon Mazurek , Maciej Wielgosz

A Novel Metric for Measuring Data Quality in Classification Applications (extended version)

Data quality is a key element for building and optimizing good learning models. Despite many attempts to characterize data quality, there is still a need for rigorous formalization and an efficient measure of the quality from available…

Machine Learning · Computer Science 2023-12-14 Jouseau Roxane , Salva Sébastien , Samir Chafik

Improving Data Quality through Deep Learning and Statistical Models

Traditional data quality control methods are based on users experience or previously established business rules, and this limits performance in addition to being a very time consuming process with lower than desirable accuracy. Utilizing…

Artificial Intelligence · Computer Science 2018-10-17 Wei Dai , Kenji Yoshigoe , William Parsley

A Data Quality-Driven View of MLOps

Developing machine learning models can be seen as a process similar to the one established for traditional software development. A key difference between the two lies in the strong dependency between the quality of a machine learning model…

Machine Learning · Computer Science 2021-02-17 Cedric Renggli , Luka Rimanic , Nezihe Merve Gürel , Bojan Karlaš , Wentao Wu , Ce Zhang

Quality of Data in Machine Learning

A common assumption exists according to which machine learning models improve their performance when they have more data to learn from. In this study, the authors wished to clarify the dilemma by performing an empirical experiment utilizing…

Machine Learning · Computer Science 2021-12-20 Antti Kariluoto , Arto Pärnänen , Joni Kultanen , Jukka Soininen , Pekka Abrahamsson

Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data

Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many…

Machine Learning · Computer Science 2021-01-06 Hyeongmin Cho , Sangkyun Lee

The Effects of Data Quality on Machine Learning Performance on Tabular Data

Modern artificial intelligence (AI) applications require large quantities of training and test data. This need creates critical challenges not only concerning the availability of such data, but also regarding its quality. For example,…

Databases · Computer Science 2025-05-15 Sedir Mohammed , Lukas Budach , Moritz Feuerpfeil , Nina Ihde , Andrea Nathansen , Nele Noack , Hendrik Patzlaff , Felix Naumann , Hazar Harmouch

DsDm: Model-Aware Dataset Selection with Datamodels

When selecting data for training large-scale models, standard practice is to filter for examples that match human notions of data quality. Such filtering yields qualitatively clean datapoints that intuitively should improve model behavior.…

Machine Learning · Computer Science 2024-01-24 Logan Engstrom , Axel Feldmann , Aleksander Madry

How Data Quality Affects Machine Learning Models for Credit Risk Assessment

Machine Learning (ML) models are being increasingly employed for credit risk evaluation, with their effectiveness largely hinging on the quality of the input data. In this paper we investigate the impact of several data quality issues,…

Machine Learning · Computer Science 2025-11-18 Andrea Maurino

Quality Estimation without Human-labeled Data

Quality estimation aims to measure the quality of translated content without access to a reference translation. This is crucial for machine translation systems in real-world scenarios where high-quality translation is needed. While many…

Computation and Language · Computer Science 2021-02-09 Yi-Lin Tuan , Ahmed El-Kishky , Adithya Renduchintala , Vishrav Chaudhary , Francisco Guzmán , Lucia Specia

Enhancing Machine Learning Performance through Intelligent Data Quality Assessment: An Unsupervised Data-centric Framework

Poor data quality limits the advantageous power of Machine Learning (ML) and weakens high-performing ML software systems. Nowadays, data are more prone to the risk of poor quality due to their increasing volume and complexity. Therefore,…

Machine Learning · Computer Science 2025-02-20 Manal Rahal , Bestoun S. Ahmed , Gergely Szabados , Torgny Fornstedt , Jorgen Samuelsson

Detecting Quality Problems in Data Models by Clustering Heterogeneous Data Values

Data is of high quality if it is fit for its intended use. The quality of data is influenced by the underlying data model and its quality. One major quality problem is the heterogeneity of data as quality aspects such as understandability…

Machine Learning · Computer Science 2021-11-15 Viola Wenz , Arno Kesper , Gabriele Taentzer

Probabilistic Deep Learning to Quantify Uncertainty in Air Quality Forecasting

Data-driven forecasts of air quality have recently achieved more accurate short-term predictions. Despite their success, most of the current data-driven solutions lack proper quantifications of model uncertainty that communicate how much to…

Machine Learning · Computer Science 2021-12-07 Abdulmajid Murad , Frank Alexander Kraemer , Kerstin Bach , Gavin Taylor

What is the Value of Data? On Mathematical Methods for Data Quality Estimation

Data is one of the most important assets of the information age, and its societal impact is undisputed. Yet, rigorous methods of assessing the quality of data are lacking. In this paper, we propose a formal definition for the quality of a…

Machine Learning · Computer Science 2020-05-13 Netanel Raviv , Siddharth Jain , Jehoshua Bruck

Exploring Prediction Uncertainty in Machine Translation Quality Estimation

Machine Translation Quality Estimation is a notoriously difficult task, which lessens its usefulness in real-world translation environments. Such scenarios can be improved if quality predictions are accompanied by a measure of uncertainty.…

Computation and Language · Computer Science 2016-07-01 Daniel Beck , Lucia Specia , Trevor Cohn

Classification of datasets with imputed missing values: does imputation quality matter?

Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods,…

Machine Learning · Computer Science 2023-12-20 Tolou Shadbahr , Michael Roberts , Jan Stanczuk , Julian Gilbey , Philip Teare , Sören Dittmer , Matthew Thorpe , Ramon Vinas Torne , Evis Sala , Pietro Lio , Mishal Patel , AIX-COVNET Collaboration , James H. F. Rudd , Tuomas Mirtti , Antti Rannikko , John A. D. Aston , Jing Tang , Carola-Bibiane Schönlieb

From Data Quality to Model Quality: an Exploratory Study on Deep Learning

Nowadays, people strive to improve the accuracy of deep learning models. However, very little work has focused on the quality of data sets. In fact, data quality determines model quality. Therefore, it is important for us to make research…

Machine Learning · Computer Science 2019-07-01 Tianxing He , Shengcheng Yu , Ziyuan Wang , Jieqiong Li , Zhenyu Chen

Benchmarking Machine Learning Technologies for Software Defect Detection

Machine Learning approaches are good in solving problems that have less information. In most cases, the software domain problems characterize as a process of learning that depend on the various circumstances and changes accordingly. A…

Software Engineering · Computer Science 2015-06-26 Saiqa Aleem , Luiz Fernando Capretz , Faheem Ahmed

Impact of Data Pruning on Machine Learning Algorithm Performance

Dataset pruning is the process of removing sub-optimal tuples from a dataset to improve the learning of a machine learning model. In this paper, we compared the performance of different algorithms, first on an unpruned dataset and then on…

Machine Learning · Computer Science 2019-01-31 Arun Thundyill Saseendran , Lovish Setia , Viren Chhabria , Debrup Chakraborty , Aneek Barman Roy

On Evaluating the Quality of Rule-Based Classification Systems

Two indicators are classically used to evaluate the quality of rule-based classification systems: predictive accuracy, i.e. the system's ability to successfully reproduce learning data and coverage, i.e. the proportion of possible cases for…

Artificial Intelligence · Computer Science 2020-04-07 Nassim Dehouche