Related papers: Questionable practices in machine learning

Reproducibility in Machine Learning-based Research: Overview, Barriers and Drivers

Many research fields are currently reckoning with issues of poor levels of reproducibility. Some label it a "crisis", and research employing or building Machine Learning (ML) models is no exception. Issues including lack of transparency,…

Software Engineering · Computer Science 2025-02-27 Harald Semmelrock , Tony Ross-Hellauer , Simone Kopeinik , Dieter Theiler , Armin Haberl , Stefan Thalmann , Dominik Kowald

Chasing Shadows: Pitfalls in LLM Security Research

Large language models (LLMs) are increasingly prevalent in security research. Their unique characteristics, however, introduce challenges that undermine established paradigms of reproducibility, rigor, and evaluation. Prior work has…

Cryptography and Security · Computer Science 2025-12-16 Jonathan Evertz , Niklas Risse , Nicolai Neuer , Andreas Müller , Philipp Normann , Gaetano Sapia , Srishti Gupta , David Pape , Soumya Shaw , Devansh Srivastav , Christian Wressnegger , Erwin Quiring , Thorsten Eisenhofer , Daniel Arp , Lea Schönherr

Position: Why We Must Rethink Empirical Research in Machine Learning

We warn against a common but incomplete understanding of empirical research in machine learning that leads to non-replicable results, makes findings unreliable, and threatens to undermine progress in the field. To overcome this alarming…

Machine Learning · Computer Science 2024-05-28 Moritz Herrmann , F. Julian D. Lange , Katharina Eggensperger , Giuseppe Casalicchio , Marcel Wever , Matthias Feurer , David Rügamer , Eyke Hüllermeier , Anne-Laure Boulesteix , Bernd Bischl

Troubling Trends in Machine Learning Scholarship

Collectively, machine learning (ML) researchers are engaged in the creation and dissemination of knowledge about data-driven algorithms. In a given paper, researchers might aspire to any subset of the following goals, among others: to…

Machine Learning · Statistics 2018-07-27 Zachary C. Lipton , Jacob Steinhardt

REFORMS: Reporting Standards for Machine Learning Based Science

Machine learning (ML) methods are proliferating in scientific research. However, the adoption of these methods has been accompanied by failures of validity, reproducibility, and generalizability. These failures can hinder scientific…

Machine Learning · Computer Science 2023-09-21 Sayash Kapoor , Emily Cantrell , Kenny Peng , Thanh Hien Pham , Christopher A. Bail , Odd Erik Gundersen , Jake M. Hofman , Jessica Hullman , Michael A. Lones , Momin M. Malik , Priyanka Nanayakkara , Russell A. Poldrack , Inioluwa Deborah Raji , Michael Roberts , Matthew J. Salganik , Marta Serra-Garcia , Brandon M. Stewart , Gilles Vandewiele , Arvind Narayanan

Pitfalls and potentials in simulation studies: Questionable research practices in comparative simulation studies allow for spurious claims of superiority of any method

Comparative simulation studies are workhorse tools for benchmarking statistical methods. As with other empirical studies, the success of simulation studies hinges on the quality of their design, execution and reporting. If not conducted…

Methodology · Statistics 2023-03-10 Samuel Pawel , Lucas Kook , Kelly Reeve

Reproducibility in Machine Learning-Driven Research

Research is facing a reproducibility crisis, in which the results and findings of many studies are difficult or even impossible to reproduce. This is also the case in machine learning (ML) and artificial intelligence (AI) research. Often,…

Machine Learning · Computer Science 2023-07-21 Harald Semmelrock , Simone Kopeinik , Dieter Theiler , Tony Ross-Hellauer , Dominik Kowald

Existing Large Language Model Unlearning Evaluations Are Inconclusive

Machine unlearning aims to remove sensitive or undesired data from large language models. However, recent studies suggest that unlearning is often shallow, claiming that removed knowledge can easily be recovered. In this work, we critically…

Machine Learning · Computer Science 2025-06-03 Zhili Feng , Yixuan Even Xu , Alexander Robey , Robert Kirk , Xander Davies , Yarin Gal , Avi Schwarzschild , J. Zico Kolter

A Survey on Reproducibility by Evaluating Deep Reinforcement Learning Algorithms on Real-World Robots

As reinforcement learning (RL) achieves more success in solving complex tasks, more care is needed to ensure that RL research is reproducible and that algorithms herein can be compared easily and fairly with minimal bias. RL results are,…

Machine Learning · Computer Science 2019-09-12 Nicolai A. Lynnerup , Laura Nolling , Rasmus Hasle , John Hallam

More Rigorous Software Engineering Would Improve Reproducibility in Machine Learning Research

While experimental reproduction remains a pillar of the scientific method, we observe that the software best practices supporting the reproduction of machine learning ( ML ) research are often undervalued or overlooked, leading both to poor…

Software Engineering · Computer Science 2025-09-03 Moritz Wolter , Lokesh Veeramacheneni , Charles Tapley Hoyt

Ten ways to fool the masses with machine learning

If you want to tell people the truth, make them laugh, otherwise they'll kill you. (source unclear) Machine learning and deep learning are the technologies of the day for developing intelligent automatic systems. However, a key hurdle for…

Machine Learning · Computer Science 2019-01-08 Fayyaz Minhas , Amina Asif , Asa Ben-Hur

Sources of Irreproducibility in Machine Learning: A Review

Background: Many published machine learning studies are irreproducible. Issues with methodology and not properly accounting for variation introduced by the algorithm themselves or their implementations are attributed as the main…

Machine Learning · Computer Science 2023-04-17 Odd Erik Gundersen , Kevin Coakley , Christine Kirkpatrick , Yolanda Gil

Pitfalls in Evaluating Language Model Forecasters

Large language models (LLMs) have recently been applied to forecasting tasks, with some works claiming these systems match or exceed human performance. In this paper, we argue that, as a community, we should be careful about such…

Machine Learning · Computer Science 2025-06-03 Daniel Paleka , Shashwat Goel , Jonas Geiping , Florian Tramèr

Position: LLM Unlearning Benchmarks are Weak Measures of Progress

Unlearning methods have the potential to improve the privacy and safety of large language models (LLMs) by removing sensitive or harmful information post hoc. The LLM unlearning research community has increasingly turned toward empirical…

Computation and Language · Computer Science 2025-04-09 Pratiksha Thaker , Shengyuan Hu , Neil Kale , Yash Maurya , Zhiwei Steven Wu , Virginia Smith

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

The recent popularity of large language models (LLMs) has brought a significant impact to boundless fields, particularly through their open-ended ecosystem such as the APIs, open-sourced models, and plugins. However, with their widespread…

Machine Learning · Computer Science 2023-08-31 Wentao Ye , Mingfeng Ou , Tianyi Li , Yipeng chen , Xuetao Ma , Yifan Yanggong , Sai Wu , Jie Fu , Gang Chen , Haobo Wang , Junbo Zhao

Recognizing Limits: Investigating Infeasibility in Large Language Models

Large language models (LLMs) have shown remarkable performance in various tasks but often fail to handle queries that exceed their knowledge and capabilities, leading to incorrect or fabricated responses. This paper addresses the need for…

Computation and Language · Computer Science 2025-08-27 Wenbo Zhang , Zihang Xu , Hengrui Cai

Reflections on the Reproducibility of Commercial LLM Performance in Empirical Software Engineering Studies

Large Language Models have gained remarkable interest in industry and academia. The increasing interest in LLMs in academia is also reflected in the number of publications on this topic over the last years. For instance, alone 78 of the…

Software Engineering · Computer Science 2025-11-18 Florian Angermeir , Maximilian Amougou , Mark Kreitz , Andreas Bauer , Matthias Linhuber , Davide Fucci , Fabiola Moyón C. , Daniel Mendez , Tony Gorschek

Evaluating Reasoning Models for Queries with Presuppositions

Millions of users turn to AI models for their information needs. It is conceivable that a large number of user queries contain assumptions that may be factually inaccurate. Prior work notes that large language models (LLMs) often fail to…

Computation and Language · Computer Science 2026-05-06 Rose Sathyanathan , Kinshuk Vasisht , Danish Pruthi

Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review

Scholarly peer review is a cornerstone of scientific advancement, but the system is under strain due to increasing manuscript submissions and the labor-intensive nature of the process. Recent advancements in large language models (LLMs)…

Computation and Language · Computer Science 2024-12-03 Rui Ye , Xianghe Pang , Jingyi Chai , Jiaao Chen , Zhenfei Yin , Zhen Xiang , Xiaowen Dong , Jing Shao , Siheng Chen

Challenges and Contributing Factors in the Utilization of Large Language Models (LLMs)

With the development of large language models (LLMs) like the GPT series, their widespread use across various application scenarios presents a myriad of challenges. This review initially explores the issue of domain specificity, where LLMs…

Computation and Language · Computer Science 2023-10-23 Xiaoliang Chen , Liangbin Li , Le Chang , Yunhe Huang , Yuxuan Zhao , Yuxiao Zhang , Dinuo Li