Related papers: Data Quality Evaluation using Probability Models
In this paper, we delve into the critical aspect of dataset quality assessment in machine learning classification tasks. Leveraging a variety of nine distinct datasets, each crafted for classification tasks with varying complexity levels,…
Data quality is a key element for building and optimizing good learning models. Despite many attempts to characterize data quality, there is still a need for rigorous formalization and an efficient measure of the quality from available…
Traditional data quality control methods are based on users experience or previously established business rules, and this limits performance in addition to being a very time consuming process with lower than desirable accuracy. Utilizing…
Developing machine learning models can be seen as a process similar to the one established for traditional software development. A key difference between the two lies in the strong dependency between the quality of a machine learning model…
A common assumption exists according to which machine learning models improve their performance when they have more data to learn from. In this study, the authors wished to clarify the dilemma by performing an empirical experiment utilizing…
Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many…
Modern artificial intelligence (AI) applications require large quantities of training and test data. This need creates critical challenges not only concerning the availability of such data, but also regarding its quality. For example,…
When selecting data for training large-scale models, standard practice is to filter for examples that match human notions of data quality. Such filtering yields qualitatively clean datapoints that intuitively should improve model behavior.…
Machine Learning (ML) models are being increasingly employed for credit risk evaluation, with their effectiveness largely hinging on the quality of the input data. In this paper we investigate the impact of several data quality issues,…
Quality estimation aims to measure the quality of translated content without access to a reference translation. This is crucial for machine translation systems in real-world scenarios where high-quality translation is needed. While many…
Poor data quality limits the advantageous power of Machine Learning (ML) and weakens high-performing ML software systems. Nowadays, data are more prone to the risk of poor quality due to their increasing volume and complexity. Therefore,…
Data is of high quality if it is fit for its intended use. The quality of data is influenced by the underlying data model and its quality. One major quality problem is the heterogeneity of data as quality aspects such as understandability…
Data-driven forecasts of air quality have recently achieved more accurate short-term predictions. Despite their success, most of the current data-driven solutions lack proper quantifications of model uncertainty that communicate how much to…
Data is one of the most important assets of the information age, and its societal impact is undisputed. Yet, rigorous methods of assessing the quality of data are lacking. In this paper, we propose a formal definition for the quality of a…
Machine Translation Quality Estimation is a notoriously difficult task, which lessens its usefulness in real-world translation environments. Such scenarios can be improved if quality predictions are accompanied by a measure of uncertainty.…
Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods,…
Nowadays, people strive to improve the accuracy of deep learning models. However, very little work has focused on the quality of data sets. In fact, data quality determines model quality. Therefore, it is important for us to make research…
Machine Learning approaches are good in solving problems that have less information. In most cases, the software domain problems characterize as a process of learning that depend on the various circumstances and changes accordingly. A…
Dataset pruning is the process of removing sub-optimal tuples from a dataset to improve the learning of a machine learning model. In this paper, we compared the performance of different algorithms, first on an unpruned dataset and then on…
Two indicators are classically used to evaluate the quality of rule-based classification systems: predictive accuracy, i.e. the system's ability to successfully reproduce learning data and coverage, i.e. the proportion of possible cases for…