English
Related papers

Related papers: PredictaBoard: Benchmarking LLM Score Predictabili…

200 papers

Evaluating Large Language Models (LLMs) as general-purpose agents is essential for understanding their capabilities and facilitating their integration into practical applications. However, the evaluation process presents substantial…

Computation and Language · Computer Science 2024-12-25 Chang Ma , Junlei Zhang , Zhihao Zhu , Cheng Yang , Yujiu Yang , Yaohui Jin , Zhenzhong Lan , Lingpeng Kong , Junxian He

Predictive analysis is a cornerstone of modern decision-making, with applications in various domains. Large Language Models (LLMs) have emerged as powerful tools in enabling nuanced, knowledge-intensive conversations, thus aiding in complex…

Computation and Language · Computer Science 2025-05-26 Qin Chen , Yuanyi Ren , Xiaojun Ma , Yuyang Shi

While small language models (SLMs) have shown promise on various reasoning tasks, their ability to judge the correctness of answers remains unclear compared to large language models (LLMs). Prior work on LLM-as-a-judge frameworks typically…

Artificial Intelligence · Computer Science 2025-11-21 Zhenyu Bi , Gaurav Srivastava , Yang Li , Meng Lu , Swastik Roy , Morteza Ziyadi , Xuan Wang

Quantitative risk assessment in high-stakes domains relies on structured expert elicitation to estimate unobservable properties. The gold standard - the Delphi method - produces calibrated, auditable judgments but requires months of…

Artificial Intelligence · Computer Science 2026-02-10 Tobias Lorenz , Mario Fritz

To address this gap, we introduce Libra-Leaderboard, a comprehensive framework designed to rank LLMs through a balanced evaluation of performance and safety. Combining a dynamic leaderboard with an interactive LLM arena, Libra-Leaderboard…

The growing integration of Large Language Models (LLMs) into critical societal domains has raised concerns about embedded biases that can perpetuate stereotypes and undermine fairness. Such biases may stem from historical inequalities in…

Computation and Language · Computer Science 2025-10-17 Riccardo Cantini , Alessio Orsino , Massimo Ruggiero , Domenico Talia

Large language models (LLMs) could be valuable personal AI agents across various domains, provided they can precisely follow user instructions. However, recent studies have shown significant limitations in LLMs' instruction-following…

Artificial Intelligence · Computer Science 2025-03-31 Juyeon Heo , Miao Xiong , Christina Heinze-Deml , Jaya Narain

Large language models (LLMs) are powerful tools capable of handling diverse tasks. Comparing and selecting appropriate LLMs for specific tasks requires systematic evaluation methods, as models exhibit varying capabilities across different…

Computation and Language · Computer Science 2025-06-04 Anna Sokol , Elizabeth Daly , Michael Hind , David Piorkowski , Xiangliang Zhang , Nuno Moniz , Nitesh Chawla

Large Language Models (LLMs) changed the way we design and interact with software systems. Their ability to process and extract information from text has drastically improved productivity in a number of routine tasks. Developers that want…

Machine Learning · Computer Science 2025-08-26 Federico Errica , Giuseppe Siracusano , Davide Sanvito , Roberto Bifulco

The rapid rise of large language models (LLMs) is reshaping the landscape of automatic assessment in education. While these systems demonstrate substantial advantages in adaptability to diverse question types and flexibility in output…

Although demonstrating remarkable performance on reasoning tasks, Large Language Models (LLMs) still tend to fabricate unreliable responses when confronted with problems that are unsolvable or beyond their capability, severely undermining…

Computation and Language · Computer Science 2025-11-13 Boyang Xue , Qi Zhu , Rui Wang , Sheng Wang , Hongru Wang , Minda Hu , Fei Mi , Yasheng Wang , Lifeng Shang , Qun Liu , Kam-Fai Wong

Leaderboard scores on public benchmarks have been steadily rising and converging, with many frontier language models now separated by only marginal differences. However, these scores often fail to match users' day to day experience, because…

Artificial Intelligence · Computer Science 2026-02-05 Yiliang Song , Hongjun An , Jiangong Xiao , Haofei Zhao , Jiawei Shao , Xuelong Li

Large language model (LLM) benchmarks inform LLM use decisions (e.g., "is this LLM safe to deploy for my use case and context?"). However, benchmarks may be rendered unreliable by various failure modes that impact benchmark bias, variance,…

We investigate the predictability of large language model (LLM) capabilities: given records of past experiments using different model families, numbers of parameters, tasks, and numbers of in-context examples, can we accurately predict LLM…

Computation and Language · Computer Science 2023-11-01 Qinyuan Ye , Harvey Yiyun Fu , Xiang Ren , Robin Jia

Large Language Models (LLMs) have exhibited great performance in autonomously calling various tools in external environments, leading to better problem solving and task automation capabilities. However, these external tools also amplify…

Cryptography and Security · Computer Science 2025-09-10 Hongfei Xia , Hongru Wang , Zeming Liu , Qian Yu , Yuhang Guo , Haifeng Wang

Large Language Models (LLMs) are increasingly relied upon to evaluate text outputs of other LLMs, thereby influencing leaderboards and development decisions. However, concerns persist over the accuracy of these assessments and the potential…

Computation and Language · Computer Science 2024-11-27 Sumanth Doddapaneni , Mohammed Safi Ur Rahman Khan , Sshubam Verma , Mitesh M. Khapra

Large Language Models (LLMs) are widely used as automated judges, where practical value depends on both accuracy and trustworthy, risk-aware judgments. Existing approaches predominantly focus on accuracy, overlooking the necessity of…

Artificial Intelligence · Computer Science 2025-08-19 Zailong Tian , Zhuoheng Han , Yanzhe Chen , Haozhe Xu , Xi Yang , Richeng Xuan , Houfeng Wang , Lizi Liao

Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon to evaluate harmfulness in order to…

Computation and Language · Computer Science 2026-03-17 Leo Schwinn , Moritz Ladenburger , Tim Beyer , Mehrnaz Mofakhami , Gauthier Gidel , Stephan Günnemann

LLM-as-a-judge has become a promising paradigm for using large language models (LLMs) to evaluate natural language generation (NLG), but the uncertainty of its evaluation remains underexplored. This lack of reliability may limit its…

Computation and Language · Computer Science 2025-09-24 Huanxin Sheng , Xinyi Liu , Hangfeng He , Jieyu Zhao , Jian Kang

LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more…

Artificial Intelligence · Computer Science 2025-04-08 Sijun Tan , Siyuan Zhuang , Kyle Montgomery , William Y. Tang , Alejandro Cuadron , Chenguang Wang , Raluca Ada Popa , Ion Stoica
‹ Prev 1 2 3 10 Next ›