Related papers: Specificity measures and reference
Efficiency criteria for conformal prediction, such as \emph{observed fuzziness} (i.e., the sum of p-values associated with false labels), are commonly used to \emph{evaluate} the performance of given conformal predictors. Here, we…
The selection of the best classification algorithm for a given dataset is a very widespread problem. It is also a complex one, in the sense it requires to make several important methodological choices. Among them, in this work we focus on…
We consider generation and comprehension of natural language referring expression for objects in an image. Unlike generic "image captioning" which lacks natural standard evaluation criteria, quality of a referring expression may be measured…
Recent discussion of the success of feature selection methods has argued that focusing on a relatively small number of features has been counterproductive. Instead, it is suggested, the number of significant features can be in the thousands…
Classical machine learning classifiers tend to be overconfident can be unreliable outside of the laboratory benchmarks. Properly assessing the reliability of the output of the model per sample is instrumental for real-life scenarios where…
In practice, a ranking of objects with respect to given set of criteria is of considerable importance. However, due to lack of knowledge, information of time pressure, decision makers might not be able to provide a (crisp) ranking of…
Though it has been recognized that recommending serendipitous (i.e., surprising and relevant) items can be helpful for increasing users' satisfaction and behavioral intention, how to measure serendipity in the offline environment is still…
Characteristics of a benchmarking setup clearly can have some impact on the benchmark outcome. In this paper, we explore two methodologies to quantify the impact of the specific properties on the benchmarking outcome. Our first methodology…
Reasoning with fuzzy sets can be achieved through measures such as similarity and distance. However, these measures can often give misleading results when considered independently, for example giving the same value for two different pairs…
Do word embeddings converge to learn similar things over different initializations? How repeatable are experiments with word embeddings? Are all word embedding techniques equally reliable? In this paper we propose evaluating methods for…
Measures of accuracy usually score how accurate a specified credence depending on whether the proposition is true or false. A key requirement for such measures is strict propriety; that probabilities expect themselves to be most accurate.…
Reference analysis produces objective Bayesian inference, in the sense that inferential statements depend only on the assumed model and the available data, and the prior distribution used to make an inference is least informative in a…
User's intentions may be expressed through spontaneous gesturing, which have been seen only a few times or never before. Recognizing such gestures involves one shot gesture learning. While most research has focused on the recognition of the…
Previous studies have used a specific success metric within an algorithmic search framework to prove machine learning impossibility results. However, this specific success metric prevents us from applying these results on other forms of…
Several performance measures can be used for evaluating classification results: accuracy, F-measure, and many others. Can we say that some of them are better than others, or, ideally, choose one measure that is best in all situations? To…
This paper introduces a novel approach to project success evaluation by integrating fuzzy logic into an existing construct. Traditional Likert-scale measures often overlook the context-dependent and multifaceted nature of project success.…
Estimating the effort and quality of a system is a critical step at the beginning of every software project. It is necessary to have reliable ways of calculating these measures, and, it is even better when the calculation can be done as…
This article explores the extension of well-known F1 score used for assessing the performance of binary classifiers. We propose the new metric using probabilistic interpretation of precision, recall, specificity, and negative predictive…
In this paper a first attempt at deriving an improved performance measure for language models, the probability ratio measure (PRM) is described. In a proof of concept experiment, it is shown that PRM correlates better with recognition…
A key trait of stochastic optimizers is that multiple runs of the same optimizer in attempting to solve the same problem can produce different results. As a result, their performance is evaluated over several repeats, or runs, on the…