English
Related papers

Related papers: Changing Answer Order Can Decrease MMLU Accuracy

200 papers

Large Language Model (LLM) leaderboards based on benchmark rankings are regularly used to guide practitioners in model selection. Often, the published leaderboard rankings are taken at face value - we show this is a (potentially costly)…

Large Language Models (LLMs) have demonstrated remarkable capabilities in various NLP tasks. However, previous works have shown these models are sensitive towards prompt wording, and few-shot demonstrations and their order, posing…

Computation and Language · Computer Science 2023-08-23 Pouya Pezeshkpour , Estevam Hruschka

Large Language Models (LLMs) effectiveness is usually evaluated by means of benchmarks such as MMLU, ARC-C, or HellaSwag, where questions are presented in their original wording, thus in a fixed, standardized format. However, real-world…

Computation and Language · Computer Science 2025-09-05 Riccardo Lunardi , Vincenzo Della Mea , Stefano Mizzaro , Kevin Roitero

Large audio-language models (LALMs) are often used in tasks that involve reasoning over ordered options. An open question is whether their predictions are influenced by the order of answer choices, which would indicate a form of position…

Sound · Computer Science 2026-02-25 Yu-Xiang Lin , Chen-An Li , Sheng-Lun Wei , Po-Chun Chen , Hsin-Hsi Chen , Hung-yi Lee

NLP benchmarks rely on standardized datasets for training and evaluating models and are crucial for advancing the field. Traditionally, expert annotations ensure high-quality labels; however, the cost of expert annotation does not scale…

Computation and Language · Computer Science 2025-09-15 Omer Nahum , Nitay Calderon , Orgad Keller , Idan Szpektor , Roi Reichart

Large Language Models (LLMs) have been evaluated using diverse question types, e.g., multiple-choice, true/false, and short/long answers. This study answers an unexplored question about the impact of different question types on LLM accuracy…

Computation and Language · Computer Science 2026-04-29 Seok Hwan Song , Mohna Chakraborty , Qi Li , Wallapak Tavanapong

Prompt optimization algorithms for Large Language Models (LLMs) excel in multi-step reasoning but still lack effective uncertainty estimation. This paper introduces a benchmark dataset to evaluate uncertainty metrics, focusing on Answer,…

Machine Learning · Computer Science 2024-12-30 Pei-Fu Guo , Yun-Da Tsai , Shou-De Lin

As large language models (LLMs) become integral to diverse applications, ensuring their reliability under varying input conditions is crucial. One key issue affecting this reliability is order sensitivity, wherein slight variations in the…

Computation and Language · Computer Science 2025-05-12 Bryan Guan , Tanya Roosta , Peyman Passban , Mehdi Rezagholizadeh

Recent advances in large audio language models (LALMs) have primarily been assessed using a multiple-choice question answering (MCQA) framework. However, subtle changes, such as shifting the order of choices, result in substantially…

Computation and Language · Computer Science 2025-10-07 Fernando López , Santosh Kesiraju , Jordi Luque

This study investigates whether repeating questions within prompts influences the performance of large language models (LLMs). We hypothesize that reiterating a question within a single prompt might enhance the model's focus on key elements…

Computation and Language · Computer Science 2025-03-13 Sagi Shaier , Mario Sanz-Guerrero , Katharina von der Wense

Multimodal Large Language Models (MLLMs) utilize multimodal contexts consisting of text, images, or videos to solve various multimodal tasks. However, we find that changing the order of multimodal input can cause the model's performance to…

Artificial Intelligence · Computer Science 2024-10-23 Zhijie Tan , Xu Chu , Weiping Li , Tong Mo

Large Language Models (LLMs) have demonstrated exceptional capabilities, yet selecting the most reliable response from multiple LLMs remains a challenge, particularly in resource-constrained settings. Existing approaches often depend on…

Computation and Language · Computer Science 2025-10-06 Aakriti Agrawal , Rohith Aralikatti , Anirudh Satheesh , Souradip Chakraborty , Amrit Singh Bedi , Furong Huang

The interactive nature of Large Language Models (LLMs) theoretically allows models to refine and improve their answers, yet systematic analysis of the multi-turn behavior of LLMs remains limited. In this paper, we propose the FlipFlop…

Computation and Language · Computer Science 2024-02-22 Philippe Laban , Lidiya Murakhovs'ka , Caiming Xiong , Chien-Sheng Wu

Benchmarking outcomes increasingly govern trust, selection, and deployment of LLMs, yet these evaluations remain vulnerable to semantically equivalent adversarial perturbations. Prior work on adversarial robustness in NLP has emphasized…

Machine Learning · Computer Science 2025-10-16 Ivan Dubrovsky , Anastasia Orlova , Illarion Iov , Nina Gubina , Irena Gureeva , Alexey Zaytsev

Multimodal Large Language Models (MLLM) classification performance depends critically on evaluation protocol and ground truth quality. Studies comparing MLLMs with supervised and vision-language models report conflicting conclusions, and we…

Computer Vision and Pattern Recognition · Computer Science 2026-03-10 Nikita Kisel , Illia Volkov , Klara Janouskova , Jiri Matas

Benchmarks have emerged as the central approach for evaluating Large Language Models (LLMs). The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance.…

Computation and Language · Computer Science 2024-06-07 Melissa Ailem , Katerina Marazopoulou , Charlotte Siska , James Bono

In this paper, we propose a ``Generalization Stress Test" to assess Large Language Models' (LLMs) generalization ability under slight and controlled perturbations, including option length, problem types, and irrelevant noun replacements. We…

Computation and Language · Computer Science 2025-09-23 Guangxiang Zhao , Saier Hu , Xiaoqi Jian , Jinzhu Wu , Yuhan Wu , Change Jia , Lin Sun , Xiangzheng Zhang

The use of Large Language Models (LLMs) is proliferating, yet their performance is observed to vary based on prompting styles and tones. In this study, we investigate both whether and how tonal variations in prompts lead to disparate LLM…

Artificial Intelligence · Computer Science 2026-05-29 Om Dobariya , Akhil Kumar

Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice for measuring LLMs' ability to reason and answer knowledge-based questions. Through a synthetic NonsenseQA benchmark, we observe that different LLMs exhibit…

Computation and Language · Computer Science 2026-02-20 Mateusz Nowak , Xavier Cadet , Peter Chin

Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true…

‹ Prev 1 2 3 10 Next ›