Related papers: Evaluation Evaluation a Monte Carlo study

Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation

Commonly used evaluation measures including Recall, Precision, F-Measure and Rand Accuracy are biased and should not be used without clear understanding of the biases, and corresponding identification of chance or base case levels of the…

Machine Learning · Computer Science 2020-11-02 David M. W. Powers

Unbiased Comparative Evaluation of Ranking Functions

Eliciting relevance judgments for ranking evaluation is labor-intensive and costly, motivating careful selection of which documents to judge. Unlike traditional approaches that make this selection deterministically, probabilistic sampling…

Information Retrieval · Computer Science 2016-04-26 Tobias Schnabel , Adith Swaminathan , Peter Frazier , Thorsten Joachims

On Sampling-Based Training Criteria for Neural Language Modeling

As the vocabulary size of modern word-based language models becomes ever larger, many sampling-based training criteria are proposed and investigated. The essence of these sampling methods is that the softmax-related traversal over the…

Computation and Language · Computer Science 2021-06-18 Yingbo Gao , David Thulke , Alexander Gerstenberger , Khoa Viet Tran , Ralf Schlüter , Hermann Ney

A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice

Classification systems are evaluated in a countless number of papers. However, we find that evaluation practice is often nebulous. Frequently, metrics are selected without arguments, and blurry terminology invites misconceptions. For…

Machine Learning · Computer Science 2024-07-03 Juri Opitz

On the Interaction of Belief Bias and Explanations

A myriad of explainability methods have been proposed in recent years, but there is little consensus on how to evaluate them. While automatic metrics allow for quick benchmarking, it isn't clear how such metrics reflect human interaction…

Computation and Language · Computer Science 2021-06-30 Ana Valeria Gonzalez , Anna Rogers , Anders Søgaard

Evaluation Measures for Relevance and Credibility in Ranked Lists

Recent discussions on alternative facts, fake news, and post truth politics have motivated research on creating technologies that allow people not only to access information, but also to assess the credibility of the information presented…

Information Retrieval · Computer Science 2017-08-25 Christina Lioma , Jakob Grue Simonsen , Birger Larsen

Robust Output Analysis with Monte-Carlo Methodology

In predictive modeling with simulation or machine learning, it is critical to accurately assess the quality of estimated values through output analysis. In recent decades output analysis has become enriched with methods that quantify the…

Methodology · Statistics 2023-10-27 Kimia Vahdat , Sara Shashaani

Beyond ELBOs: A Large-Scale Evaluation of Variational Methods for Sampling

Monte Carlo methods, Variational Inference, and their combinations play a pivotal role in sampling from intractable probability distributions. However, current studies lack a unified evaluation framework, relying on disparate performance…

Machine Learning · Computer Science 2024-06-12 Denis Blessing , Xiaogang Jia , Johannes Esslinger , Francisco Vargas , Gerhard Neumann

Evaluation for Change

Evaluation is the central means for assessing, understanding, and communicating about NLP models. In this position paper, we argue evaluation should be more than that: it is a force for driving change, carrying a sociological and political…

Computation and Language · Computer Science 2022-12-23 Rishi Bommasani

Monte Carlo Estimation for Imprecise Probabilities: Basic Properties

We describe Monte Carlo methods for estimating lower envelopes of expectations of real random variables. We prove that the estimation bias is negative and that its absolute value shrinks with increasing sample size. We discuss fairly…

Probability · Mathematics 2019-09-02 Arne Decadt , Gert de Cooman , Jasper De Bock

What the F-measure doesn't measure: Features, Flaws, Fallacies and Fixes

The F-measure or F-score is one of the most commonly used single number measures in Information Retrieval, Natural Language Processing and Machine Learning, but it is based on a mistake, and the flawed assumptions render it unsuitable for…

Information Retrieval · Computer Science 2019-09-13 David M. W. Powers

Conservative Hypothesis Tests and Confidence Intervals using Importance Sampling

Importance sampling is a common technique for Monte Carlo approximation, including Monte Carlo approximation of p-values. Here it is shown that a simple correction of the usual importance sampling p-values creates valid p-values, meaning…

Computation · Statistics 2011-04-12 Matthew T. Harrison

Generalized Measures of Anticipation and Responsivity in Online Language Processing

We introduce a generalization of classic information-theoretic measures of predictive uncertainty in online language processing, based on the simulation of expected continuations of incremental linguistic contexts. Our framework provides a…

Computation and Language · Computer Science 2024-10-15 Mario Giulianelli , Andreas Opedal , Ryan Cotterell

Proof of principle for a self-governing prediction and forecasting reward algorithm

We use Monte Carlo techniques to simulate an organized prediction competition between a group of a scientific experts acting under the influence of a ``self-governing'' prediction reward algorithm. Our aim is to illustrate the advantages of…

Social and Information Networks · Computer Science 2023-05-09 J. O. Gonzalez-Hernandez , Jonathan Marino , Ted Rogers , Brandon Velasco

Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation

The correlation between NLG automatic evaluation metrics and human evaluation is often regarded as a critical criterion for assessing the capability of an evaluation metric. However, different grouping methods and correlation coefficients…

Computation and Language · Computer Science 2025-01-28 Mingqi Gao , Xinyu Hu , Li Lin , Xiaojun Wan

Reasoning under Uncertainty: Some Monte Carlo Results

A series of monte carlo studies were performed to compare the behavior of some alternative procedures for reasoning under uncertainty. The behavior of several Bayesian, linear model and default reasoning procedures were examined in the…

Artificial Intelligence · Computer Science 2013-03-26 Paul E. Lehner , Azar Sadigh

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the…

Computation and Language · Computer Science 2025-06-06 Zhenru Zhang , Chujie Zheng , Yangzhen Wu , Beichen Zhang , Runji Lin , Bowen Yu , Dayiheng Liu , Jingren Zhou , Junyang Lin

Mathematical models of confirmation bias

Confirmation bias is a cognitive bias that adversely affects management decisions, and mathematical modelling is an aid to its detailed understanding. Bias in opinion update about the value of a parameter is modelled here assuming that…

Other Statistics · Statistics 2022-02-08 Rose D Baker

Measuring Sample Quality with Stein's Method

To improve the efficiency of Monte Carlo estimation, practitioners are turning to biased Markov chain Monte Carlo procedures that trade off asymptotic exactness for computational speed. The reasoning is sound: a reduction in variance due to…

Machine Learning · Statistics 2019-01-03 Jackson Gorham , Lester Mackey

Regression-adjusted Monte Carlo Estimators for Shapley Values and Probabilistic Values

With origins in game theory, probabilistic values like Shapley values, Banzhaf values, and semi-values have emerged as a central tool in explainable AI. They are used for feature attribution, data attribution, data valuation, and more.…

Machine Learning · Computer Science 2026-01-14 R. Teal Witter , Yurong Liu , Christopher Musco