Related papers: Benchmarking Simulation-Based Inference

Benchmarks as Microscopes: A Call for Model Metrology

Modern language models (LMs) pose a new challenge in capability assessment. Static benchmarks inevitably saturate without providing confidence in the deployment tolerances of LM-based systems, but developers nonetheless claim that their…

Software Engineering · Computer Science 2024-07-31 Michael Saxon , Ari Holtzman , Peter West , William Yang Wang , Naomi Saphra

Benchmark Data Repositories for Better Benchmarking

In machine learning research, it is common to evaluate algorithms via their performance on standard benchmark datasets. While a growing body of work establishes guidelines for -- and levies criticisms at -- data and benchmarking practices…

Machine Learning · Computer Science 2024-11-01 Rachel Longjohn , Markelle Kelly , Sameer Singh , Padhraic Smyth

Optimization-based Quantification of Simulation Input Uncertainty via Empirical Likelihood

We study an optimization-based approach to construct statistically accurate confidence intervals for simulation performance measures under nonparametric input uncertainty. This approach computes confidence bounds from simulation runs driven…

Methodology · Statistics 2019-02-14 Henry Lam , Huajie Qian

Towards More Fine-grained and Reliable NLP Performance Prediction

Performance prediction, the task of estimating a system's performance without performing experiments, allows us to reduce the experimental burden caused by the combinatorial explosion of different datasets, languages, tasks, and models. In…

Computation and Language · Computer Science 2021-02-11 Zihuiwen Ye , Pengfei Liu , Jinlan Fu , Graham Neubig

Better than classical? The subtle art of benchmarking quantum machine learning models

Benchmarking models via classical simulations is one of the main ways to judge ideas in quantum machine learning before noise-free hardware is available. However, the huge impact of the experimental design on the results, the small scales…

Quantum Physics · Physics 2024-03-15 Joseph Bowles , Shahnawaz Ahmed , Maria Schuld

A Trust Crisis In Simulation-Based Inference? Your Posterior Approximations Can Be Unfaithful

We present extensive empirical evidence showing that current Bayesian simulation-based inference algorithms can produce computationally unfaithful posterior approximations. Our results show that all benchmarked algorithms -- (Sequential)…

Machine Learning · Statistics 2022-12-06 Joeri Hermans , Arnaud Delaunoy , François Rozet , Antoine Wehenkel , Volodimir Begy , Gilles Louppe

Bayesian Inference for Randomized Benchmarking Protocols

Randomized benchmarking (RB) protocols are standard tools for characterizing quantum devices. Prior analyses of RB protocols have not provided a complete method for analyzing realistic data, resulting in a variety of ad-hoc methods. The…

Quantum Physics · Physics 2018-02-02 Ian Hincks , Joel J. Wallman , Chris Ferrie , Chris Granade , David G. Cory

Towards Game-Playing AI Benchmarks via Performance Reporting Standards

While games have been used extensively as milestones to evaluate game-playing AI, there exists no standardised framework for reporting the obtained observations. As a result, it remains difficult to draw general conclusions about the…

Artificial Intelligence · Computer Science 2020-07-07 Vanessa Volz , Boris Naujoks

Rethinking Pareto Frontier for Performance Evaluation of Deep Neural Networks

Performance optimization of deep learning models is conducted either manually or through automatic architecture search, or a combination of both. On the other hand, their performance strongly depends on the target hardware and how…

Machine Learning · Computer Science 2022-09-23 Vahid Partovi Nia , Alireza Ghaffari , Mahdi Zolnouri , Yvon Savaria

A Continuous Benchmarking Infrastructure for High-Performance Computing Applications

For scientific software, especially those used for large-scale simulations, achieving good performance and efficiently using the available hardware resources is essential. It is important to regularly perform benchmarks to ensure the…

Performance · Computer Science 2024-06-12 Christoph Alt , Martin Lanser , Jonas Plewinski , Atin Janki , Axel Klawonn , Harald Köstler , Michael Selzer , Ulrich Rüde

Statistical Inference with Limited Memory: A Survey

The problem of statistical inference in its various forms has been the subject of decades-long extensive research. Most of the effort has been focused on characterizing the behavior as a function of the number of available samples, with far…

Machine Learning · Computer Science 2024-11-12 Tomer Berg , Or Ordentlich , Ofer Shayevitz

Measuring What Matters: Benchmarking Generative, Multimodal, and Agentic AI in Healthcare

AI models are increasingly deployed in live clinical environments where they must perform reliably across complex, high-stakes workflows that standard training and validation datasets were never designed to capture. Evaluating these systems…

Artificial Intelligence · Computer Science 2026-05-12 Prasanna Desikan , Harshit Rajgarhia , Shivali Dalmia , Ananya Mantravadi

TSI-Bench: Benchmarking Time Series Imputation

Effective imputation is a crucial preprocessing step for time series analysis. Despite the development of numerous deep learning algorithms for time series imputation, the community lacks standardized and comprehensive benchmark platforms…

Machine Learning · Computer Science 2024-11-01 Wenjie Du , Jun Wang , Linglong Qian , Yiyuan Yang , Zina Ibrahim , Fanxing Liu , Zepu Wang , Haoxin Liu , Zhiyuan Zhao , Yingjie Zhou , Wenjia Wang , Kaize Ding , Yuxuan Liang , B. Aditya Prakash , Qingsong Wen

Best practices for constructing, preparing, and evaluating protein-ligand binding affinity benchmarks

Free energy calculations are rapidly becoming indispensable in structure-enabled drug discovery programs. As new methods, force fields, and implementations are developed, assessing their expected accuracy on real-world systems…

Biomolecules · Quantitative Biology 2023-03-30 David F. Hahn , Christopher I. Bayly , Hannah E. Bruce Macdonald , John D. Chodera , Vytautas Gapsys , Antonia S. J. S. Mey , David L. Mobley , Laura Perez Benito , Christina E. M. Schindler , Gary Tresadern , Gregory L. Warren

Predictive Models from Quantum Computer Benchmarks

Holistic benchmarks for quantum computers are essential for testing and summarizing the performance of quantum hardware. However, holistic benchmarks -- such as algorithmic or randomized benchmarks -- typically do not predict a processor's…

Quantum Physics · Physics 2023-05-16 Daniel Hothem , Jordan Hines , Karthik Nataraj , Robin Blume-Kohout , Timothy Proctor

Deep Neural Network Benchmarks for Selective Classification

With the increasing deployment of machine learning models in many socially sensitive tasks, there is a growing demand for reliable and trustworthy predictions. One way to accomplish these requirements is to allow a model to abstain from…

Machine Learning · Computer Science 2024-09-19 Andrea Pugnana , Lorenzo Perini , Jesse Davis , Salvatore Ruggieri

Simulation-based stacking

Simulation-based inference has been popular for amortized Bayesian computation. It is typical to have more than one posterior approximation, from different inference algorithms, different architectures, or simply the randomness of…

Methodology · Statistics 2024-03-04 Yuling Yao , Bruno Régaldo-Saint Blancard , Justin Domke

Benchmarking Large Language Model Uncertainty for Prompt Optimization

Prompt optimization algorithms for Large Language Models (LLMs) excel in multi-step reasoning but still lack effective uncertainty estimation. This paper introduces a benchmark dataset to evaluate uncertainty metrics, focusing on Answer,…

Machine Learning · Computer Science 2024-12-30 Pei-Fu Guo , Yun-Da Tsai , Shou-De Lin

Identifying Process Improvement Opportunities through Process Execution Benchmarking

Benchmarking functionalities in current commercial process mining tools allow organizations to contextualize their process performance through high-level performance indicators, such as completion rate or throughput time. However, they do…

Software Engineering · Computer Science 2025-04-24 Luka Abb , Majid Rafiei , Timotheus Kampik , Jana-Rebecca Rehse

Meta-Metrics and Best Practices for System-Level Inference Performance Benchmarking

Benchmarking inference performance (speed) of Foundation Models such as Large Language Models (LLM) involves navigating a vast experimental landscape to understand the complex interactions between hardware and software components. However,…

Performance · Computer Science 2025-08-15 Shweta Salaria , Zhuoran Liu , Nelson Mimura Gonzalez