Related papers: A Benchmarking Framework for Model Datasets

Model Matching Challenge: Benchmarks for Ecore and BPMN Diagrams

In the last couple of years, Model Driven Engineering (MDE) gained a prominent role in the context of software engineering. In the MDE paradigm, models are considered first level artifacts which are iteratively developed by teams of…

Software Engineering · Computer Science 2014-08-26 Pit Pietsch , Klaus Müller , Bernhard Rumpe

Benchmark Data Repositories for Better Benchmarking

In machine learning research, it is common to evaluate algorithms via their performance on standard benchmark datasets. While a growing body of work establishes guidelines for -- and levies criticisms at -- data and benchmarking practices…

Machine Learning · Computer Science 2024-11-01 Rachel Longjohn , Markelle Kelly , Sameer Singh , Padhraic Smyth

Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework

Commonly, AI or machine learning (ML) models are evaluated on benchmark datasets. This practice supports innovative methodological research, but benchmark performance can be poorly correlated with performance in real-world applications -- a…

Machine Learning · Computer Science 2024-06-18 Olivier Binette , Jerome P. Reiter

Enterprise Benchmarks for Large Language Model Evaluation

The advancement of large language models (LLMs) has led to a greater challenge of having a rigorous and systematic evaluation of complex tasks performed, especially in enterprise applications. Therefore, LLMs need to be able to benchmark…

Computation and Language · Computer Science 2024-10-18 Bing Zhang , Mikio Takeuchi , Ryo Kawahara , Shubhi Asthana , Md. Maruf Hossain , Guang-Jie Ren , Kate Soule , Yada Zhu

MLModelScope: A Distributed Platform for Model Evaluation and Benchmarking at Scale

Machine Learning (ML) and Deep Learning (DL) innovations are being introduced at such a rapid pace that researchers are hard-pressed to analyze and study them. The complicated procedures for evaluating innovations, along with the lack of…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-20 Abdul Dakkak , Cheng Li , Jinjun Xiong , Wen-mei Hwu

PMLB: A Large Benchmark Suite for Machine Learning Evaluation and Comparison

The selection, development, or comparison of machine learning methods in data mining can be a difficult task based on the target problem and goals of a particular study. Numerous publicly available real-world and simulated benchmark…

Machine Learning · Computer Science 2017-03-03 Randal S. Olson , William La Cava , Patryk Orzechowski , Ryan J. Urbanowicz , Jason H. Moore

BenchML: an extensible pipelining framework for benchmarking representations of materials and molecules at scale

We introduce a machine-learning (ML) framework for high-throughput benchmarking of diverse representations of chemical systems against datasets of materials and molecules. The guiding principle underlying the benchmarking approach is to…

Machine Learning · Computer Science 2021-12-07 Carl Poelking , Felix A. Faber , Bingqing Cheng

Benchmarks as Microscopes: A Call for Model Metrology

Modern language models (LMs) pose a new challenge in capability assessment. Static benchmarks inevitably saturate without providing confidence in the deployment tolerances of LM-based systems, but developers nonetheless claim that their…

Software Engineering · Computer Science 2024-07-31 Michael Saxon , Ari Holtzman , Peter West , William Yang Wang , Naomi Saphra

Large Language Model Routing with Benchmark Datasets

There is a rapidly growing number of open-source Large Language Models (LLMs) and benchmark datasets to compare them. While some models dominate these benchmarks, no single model typically achieves the best accuracy in all tasks and use…

Computation and Language · Computer Science 2023-09-28 Tal Shnitzer , Anthony Ou , Mírian Silva , Kate Soule , Yuekai Sun , Justin Solomon , Neil Thompson , Mikhail Yurochkin

Towards Comprehensive Benchmarking Infrastructure for LLMs In Software Engineering

Large language models for code are advancing fast, yet our ability to evaluate them lags behind. Current benchmarks focus on narrow tasks and single metrics, which hide critical gaps in robustness, interpretability, fairness, efficiency,…

Software Engineering · Computer Science 2026-01-30 Daniel Rodriguez-Cardenas , Xiaochang Li , Marcos Macedo , Antonio Mastropaolo , Dipin Khati , Yuan Tian , Huajie Shao , Denys Poshyvanyk

Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists

We present a benchmark for large language models designed to tackle one of the most knowledge-intensive tasks in data science: writing feature engineering code, which requires domain knowledge in addition to a deep understanding of the…

Computation and Language · Computer Science 2024-11-01 Michał Pietruszka , Łukasz Borchmann , Aleksander Jędrosz , Paweł Morawiecki

Assessing Project-Level Fine-Tuning of ML4SE Models

Machine Learning for Software Engineering (ML4SE) is an actively growing research area that focuses on methods that help programmers in their work. In order to apply the developed methods in practice, they need to achieve reasonable quality…

Software Engineering · Computer Science 2022-06-08 Egor Bogomolov , Sergey Zhuravlev , Egor Spirin , Timofey Bryksin

BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation

As large language models (LLMs) continue to advance, the need for up-to-date and well-organized benchmarks becomes increasingly critical. However, many existing datasets are scattered, difficult to manage, and make it challenging to perform…

Machine Learning · Computer Science 2025-06-03 Eunsu Kim , Haneul Yoo , Guijin Son , Hitesh Patel , Amit Agarwal , Alice Oh

Experience: Quality Benchmarking of Datasets Used in Software Effort Estimation

Data is a cornerstone of empirical software engineering (ESE) research and practice. Data underpin numerous process and project management activities, including the estimation of development effort and the prediction of the likely location…

Software Engineering · Computer Science 2020-12-22 Michael F. Bosu , Stephen G. MacDonell

A Survey of Optimization Modeling Meets LLMs: Progress and Future Directions

By virtue of its great utility in solving real-world problems, optimization modeling has been widely employed for optimal decision-making across various sectors, but it requires substantial expertise from operations research professionals.…

Artificial Intelligence · Computer Science 2025-08-15 Ziyang Xiao , Jingrong Xie , Lilin Xu , Shisi Guan , Jingyan Zhu , Xiongwei Han , Xiaojin Fu , WingYin Yu , Han Wu , Wei Shi , Qingcan Kang , Jiahui Duan , Tao Zhong , Mingxuan Yuan , Jia Zeng , Yuan Wang , Gang Chen , Dongxiang Zhang

Designing Empirical Studies on LLM-Based Code Generation: Towards a Reference Framework

The rise of large language models (LLMs) has introduced transformative potential in automated code generation, addressing a wide range of software engineering challenges. However, empirical evaluation of LLM-based code generation lacks…

Software Engineering · Computer Science 2025-10-07 Nathalia Nascimento , Everton Guimaraes , Paulo Alencar

BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks

Large language models (LLMs) are powerful tools capable of handling diverse tasks. Comparing and selecting appropriate LLMs for specific tasks requires systematic evaluation methods, as models exhibit varying capabilities across different…

Computation and Language · Computer Science 2025-06-04 Anna Sokol , Elizabeth Daly , Michael Hind , David Piorkowski , Xiangliang Zhang , Nuno Moniz , Nitesh Chawla

Benchmark Dataset Generation and Evaluation for Excel Formula Repair with LLMs

Excel is a pervasive yet often complex tool, particularly for novice users, where runtime errors arising from logical mistakes or misinterpretations of functions pose a significant challenge. While large language models (LLMs) offer…

Software Engineering · Computer Science 2025-08-19 Ananya Singha , Harshita Sahijwani , Walt Williams , Emmanuel Aboah Boateng , Nick Hausman , Miguel Di Luca , Keegan Choudhury , Chaya Binet , Vu Le , Tianwei Chen , Oryan Rokeah Chen , Sulaiman Vesal , Sadid Hasan

A Case for Dataset Specific Profiling

Data-driven science is an emerging paradigm where scientific discoveries depend on the execution of computational AI models against rich, discipline-specific datasets. With modern machine learning frameworks, anyone can develop and execute…

Machine Learning · Computer Science 2022-08-09 Seth Ockerman , John Wu , Christopher Stewart

The Benchmarking Epistemology: Construct Validity for Evaluating Machine Learning Models

Predictive benchmarking, the evaluation of machine learning models based on predictive performance and competitive ranking, is a central epistemic practice in machine learning research and an increasingly prominent method for scientific…

Machine Learning · Computer Science 2025-10-28 Timo Freiesleben , Sebastian Zezulka