Related papers: Changing Answer Order Can Decrease MMLU Accuracy

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

Large Language Model (LLM) leaderboards based on benchmark rankings are regularly used to guide practitioners in model selection. Often, the published leaderboard rankings are taken at face value - we show this is a (potentially costly)…

Computation and Language · Computer Science 2024-07-04 Norah Alzahrani , Hisham Abdullah Alyahya , Yazeed Alnumay , Sultan Alrashed , Shaykhah Alsubaie , Yusef Almushaykeh , Faisal Mirza , Nouf Alotaibi , Nora Altwairesh , Areeb Alowisheq , M Saiful Bari , Haidar Khan

Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions

Large Language Models (LLMs) have demonstrated remarkable capabilities in various NLP tasks. However, previous works have shown these models are sensitive towards prompt wording, and few-shot demonstrations and their order, posing…

Computation and Language · Computer Science 2023-08-23 Pouya Pezeshkpour , Estevam Hruschka

On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

Large Language Models (LLMs) effectiveness is usually evaluated by means of benchmarks such as MMLU, ARC-C, or HellaSwag, where questions are presented in their original wording, thus in a fixed, standardized format. However, real-world…

Computation and Language · Computer Science 2025-09-05 Riccardo Lunardi , Vincenzo Della Mea , Stefano Mizzaro , Kevin Roitero

Hearing the Order: Investigating Position Bias in Large Audio-Language Models

Large audio-language models (LALMs) are often used in tasks that involve reasoning over ordered options. An open question is whether their predictions are influenced by the order of answer choices, which would indicate a form of position…

Sound · Computer Science 2026-02-25 Yu-Xiang Lin , Chen-An Li , Sheng-Lun Wei , Po-Chun Chen , Hsin-Hsi Chen , Hung-yi Lee

Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance

NLP benchmarks rely on standardized datasets for training and evaluating models and are crucial for advancing the field. Traditionally, expert annotations ensure high-quality labels; however, the cost of expert annotation does not scale…

Computation and Language · Computer Science 2025-09-15 Omer Nahum , Nitay Calderon , Orgad Keller , Idan Szpektor , Roi Reichart

Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked?

Large Language Models (LLMs) have been evaluated using diverse question types, e.g., multiple-choice, true/false, and short/long answers. This study answers an unexplored question about the impact of different question types on LLM accuracy…

Computation and Language · Computer Science 2026-04-29 Seok Hwan Song , Mohna Chakraborty , Qi Li , Wallapak Tavanapong

Benchmarking Large Language Model Uncertainty for Prompt Optimization

Prompt optimization algorithms for Large Language Models (LLMs) excel in multi-step reasoning but still lack effective uncertainty estimation. This paper introduces a benchmark dataset to evaluate uncertainty metrics, focusing on Answer,…

Machine Learning · Computer Science 2024-12-30 Pei-Fu Guo , Yun-Da Tsai , Shou-De Lin

The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs

As large language models (LLMs) become integral to diverse applications, ensuring their reliability under varying input conditions is crucial. One key issue affecting this reliability is order sensitivity, wherein slight variations in the…

Computation and Language · Computer Science 2025-05-12 Bryan Guan , Tanya Roosta , Peyman Passban , Mehdi Rezagholizadeh

Robustness assessment of large audio language models in multiple-choice evaluation

Recent advances in large audio language models (LALMs) have primarily been assessed using a multiple-choice question answering (MCQA) framework. However, subtle changes, such as shifting the order of choices, result in substantially…

Computation and Language · Computer Science 2025-10-07 Fernando López , Santosh Kesiraju , Jordi Luque

Asking Again and Again: Exploring LLM Robustness to Repeated Questions

This study investigates whether repeating questions within prompts influences the performance of large language models (LLMs). We hypothesize that reiterating a question within a single prompt might enhance the model's focus on key elements…

Computation and Language · Computer Science 2025-03-13 Sagi Shaier , Mario Sanz-Guerrero , Katharina von der Wense

Order Matters: Exploring Order Sensitivity in Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) utilize multimodal contexts consisting of text, images, or videos to solve various multimodal tasks. However, we find that changing the order of multimodal input can cause the model's performance to…

Artificial Intelligence · Computer Science 2024-10-23 Zhijie Tan , Xu Chu , Weiping Li , Tong Mo

Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems

Large Language Models (LLMs) have demonstrated exceptional capabilities, yet selecting the most reliable response from multiple LLMs remains a challenge, particularly in resource-constrained settings. Existing approaches often depend on…

Computation and Language · Computer Science 2025-10-06 Aakriti Agrawal , Rohith Aralikatti , Anirudh Satheesh , Souradip Chakraborty , Amrit Singh Bedi , Furong Huang

Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment

The interactive nature of Large Language Models (LLMs) theoretically allows models to refine and improve their answers, yet systematic analysis of the multi-turn behavior of LLMs remains limited. In this paper, we propose the FlipFlop…

Computation and Language · Computer Science 2024-02-22 Philippe Laban , Lidiya Murakhovs'ka , Caiming Xiong , Chien-Sheng Wu

Selective Adversarial Attacks on LLM Benchmarks

Benchmarking outcomes increasingly govern trust, selection, and deployment of LLMs, yet these evaluations remain vulnerable to semantically equivalent adversarial perturbations. Prior work on adversarial robustness in NLP has emphasized…

Machine Learning · Computer Science 2025-10-16 Ivan Dubrovsky , Anastasia Orlova , Illarion Iov , Nina Gubina , Irena Gureeva , Alexey Zaytsev

Multimodal Large Language Models as Image Classifiers

Multimodal Large Language Models (MLLM) classification performance depends critically on evaluation protocol and ground truth quality. Studies comparing MLLMs with supervised and vision-language models report conflicting conclusions, and we…

Computer Vision and Pattern Recognition · Computer Science 2026-03-10 Nikita Kisel , Illia Volkov , Klara Janouskova , Jiri Matas

Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks

Benchmarks have emerged as the central approach for evaluating Large Language Models (LLMs). The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance.…

Computation and Language · Computer Science 2024-06-07 Melissa Ailem , Katerina Marazopoulou , Charlotte Siska , James Bono

Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements

In this paper, we propose a ``Generalization Stress Test" to assess Large Language Models' (LLMs) generalization ability under slight and controlled perturbations, including option length, problem types, and irrelevant noun replacements. We…

Computation and Language · Computer Science 2025-09-23 Guangxiang Zhao , Saier Hu , Xiaoqi Jian , Jinzhu Wu , Yuhan Wu , Change Jia , Lin Sun , Xiangzheng Zhang

Mind Your Tone: Does Tone Alter LLM Performance?

The use of Large Language Models (LLMs) is proliferating, yet their performance is observed to vary based on prompting styles and tones. In this study, we investigate both whether and how tonal variations in prompts lead to disparate LLM…

Artificial Intelligence · Computer Science 2026-05-29 Om Dobariya , Akhil Kumar

ABCD: All Biases Come Disguised

Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice for measuring LLMs' ability to reason and answer knowledge-based questions. Through a synthetic NonsenseQA benchmark, we observe that different LLMs exhibit…

Computation and Language · Computer Science 2026-02-20 Mateusz Nowak , Xavier Cadet , Peter Chin

Are We Done with MMLU?

Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true…

Computation and Language · Computer Science 2025-01-13 Aryo Pradipta Gema , Joshua Ong Jun Leang , Giwon Hong , Alessio Devoto , Alberto Carlo Maria Mancino , Rohit Saxena , Xuanli He , Yu Zhao , Xiaotang Du , Mohammad Reza Ghasemi Madani , Claire Barale , Robert McHardy , Joshua Harris , Jean Kaddour , Emile van Krieken , Pasquale Minervini