Related papers: Easy Problems That LLMs Get Wrong

A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations

Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them…

Computation and Language · Computer Science 2024-10-04 Md Tahmid Rahman Laskar , Sawsan Alqahtani , M Saiful Bari , Mizanur Rahman , Mohammad Abdullah Matin Khan , Haidar Khan , Israt Jahan , Amran Bhuiyan , Chee Wei Tan , Md Rizwan Parvez , Enamul Hoque , Shafiq Joty , Jimmy Huang

Evaluating Large Language Models for Real-World Engineering Tasks

Large Language Models (LLMs) are transformative not only for daily activities but also for engineering tasks. However, current evaluations of LLMs in engineering exhibit two critical shortcomings: (i) the reliance on simplified use cases,…

Artificial Intelligence · Computer Science 2025-05-21 Rene Heesch , Sebastian Eilermann , Alexander Windmann , Alexander Diedrich , Philipp Rosenthal , Oliver Niggemann

Methods for Estimating and Improving Robustness of Language Models

Despite their outstanding performance, large language models (LLMs) suffer notorious flaws related to their preference for simple, surface-level textual relations over full semantic complexity of the problem. This proposal investigates a…

Computation and Language · Computer Science 2022-06-20 Michal Štefánik

A Primer on Large Language Models and their Limitations

This paper provides a primer on Large Language Models (LLMs) and identifies their strengths, limitations, applications and research directions. It is intended to be useful to those in academia and industry who are interested in gaining an…

Computation and Language · Computer Science 2024-12-09 Sandra Johnson , David Hyland-Wood

A Survey on Large Language Models for Automated Planning

The planning ability of Large Language Models (LLMs) has garnered increasing attention in recent years due to their remarkable capacity for multi-step reasoning and their ability to generalize across a wide range of domains. While some…

Artificial Intelligence · Computer Science 2025-02-19 Mohamed Aghzal , Erion Plaku , Gregory J. Stein , Ziyu Yao

Case Study: Testing Model Capabilities in Some Reasoning Tasks

Large Language Models (LLMs) excel in generating personalized content and facilitating interactive dialogues, showcasing their remarkable aptitude for a myriad of applications. However, their capabilities in reasoning and providing…

Computation and Language · Computer Science 2024-02-16 Min Zhang , Sato Takumi , Jack Zhang , Jun Wang

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence

The rapid rise in popularity of Large Language Models (LLMs) with emerging capabilities has spurred public curiosity to evaluate and compare different LLMs, leading many researchers to propose their own LLM benchmarks. Noticing preliminary…

Artificial Intelligence · Computer Science 2025-05-15 Timothy R. McIntosh , Teo Susnjak , Nalin Arachchilage , Tong Liu , Paul Watters , Malka N. Halgamuge

Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models

Large language models (LLMs) regularly demonstrate new and impressive performance on a wide range of language, knowledge, and reasoning benchmarks. Such rapid progress has led many commentators to argue that LLM general cognitive…

Computation and Language · Computer Science 2025-02-21 James Fodor

A Reality check of the benefits of LLM in business

Large language models (LLMs) have achieved remarkable performance in language understanding and generation tasks by leveraging vast amounts of online texts. Unlike conventional models, LLMs can adapt to new domains through prompt…

Artificial Intelligence · Computer Science 2024-06-18 Ming Cheung

Pitfalls in Evaluating Language Model Forecasters

Large language models (LLMs) have recently been applied to forecasting tasks, with some works claiming these systems match or exceed human performance. In this paper, we argue that, as a community, we should be careful about such…

Machine Learning · Computer Science 2025-06-03 Daniel Paleka , Shashwat Goel , Jonas Geiping , Florian Tramèr

Large Language Models for Multi-Robot Systems: A Survey

The rapid advancement of Large Language Models (LLMs) has opened new possibilities in Multi-Robot Systems (MRS), enabling enhanced communication, task allocation and planning, and human-robot interaction. Unlike traditional single-robot and…

Robotics · Computer Science 2026-05-05 Peihan Li , Zijian An , Shams Abrar , Lifeng Zhou

Benchmarking Linguistic Diversity of Large Language Models

The development and evaluation of Large Language Models (LLMs) has primarily focused on their task-solving capabilities, with recent models even surpassing human performance in some areas. However, this focus often neglects whether…

Computation and Language · Computer Science 2025-07-29 Yanzhu Guo , Guokan Shang , Chloé Clavel

Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks

Large Language Models (LLMs) are advancing at an amazing speed and have become indispensable across academia, industry, and daily applications. To keep pace with the status quo, this survey probes the core challenges that the rise of LLMs…

Computation and Language · Computer Science 2025-04-29 Yixin Cao , Shibo Hong , Xinze Li , Jiahao Ying , Yubo Ma , Haiyuan Liang , Yantao Liu , Zijun Yao , Xiaozhi Wang , Dan Huang , Wenxuan Zhang , Lifu Huang , Muhao Chen , Lei Hou , Qianru Sun , Xingjun Ma , Zuxuan Wu , Min-Yen Kan , David Lo , Qi Zhang , Heng Ji , Jing Jiang , Juanzi Li , Aixin Sun , Xuanjing Huang , Tat-Seng Chua , Yu-Gang Jiang

Knowledge Augmented Complex Problem Solving with Large Language Models: A Survey

Problem-solving has been a fundamental driver of human progress in numerous domains. With advancements in artificial intelligence, Large Language Models (LLMs) have emerged as powerful tools capable of tackling complex problems across…

Machine Learning · Computer Science 2025-05-07 Da Zheng , Lun Du , Junwei Su , Yuchen Tian , Yuqi Zhu , Jintian Zhang , Lanning Wei , Ningyu Zhang , Huajun Chen

EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

Large language models (LLMs) have shown strong performance on mathematical reasoning under well-defined conditions. However, real-world engineering problems involve uncertainty, context, and open-ended settings that extend beyond symbolic…

Artificial Intelligence · Computer Science 2026-05-05 Xiyuan Zhou , Xinlei Wang , Yirui He , Yang Wu , Ruixi Zou , Yuheng Cheng , Yulu Xie , Wenxuan Liu , Huan Zhao , Yan Xu , Jinjin Gu , Junhua Zhao

On the Planning Abilities of Large Language Models (A Critical Investigation with a Proposed Benchmark)

Intrigued by the claims of emergent reasoning capabilities in LLMs trained on general web corpora, in this paper, we set out to investigate their planning capabilities. We aim to evaluate (1) how good LLMs are by themselves in generating…

Artificial Intelligence · Computer Science 2023-02-15 Karthik Valmeekam , Sarath Sreedharan , Matthew Marquez , Alberto Olmo , Subbarao Kambhampati

Large Language Models and Mathematical Reasoning Failures

This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems. Unlike prior studies that focus solely on answer correctness, we rigorously analyze…

Artificial Intelligence · Computer Science 2025-02-24 Johan Boye , Birger Moell

Frontier LLMs Still Struggle with Simple Reasoning Tasks

While state-of-the-art large language models (LLMs) demonstrate advanced reasoning capabilities-achieving remarkable performance on challenging competitive math and coding benchmarks-they also frequently fail on tasks that are easy for…

Computation and Language · Computer Science 2025-07-11 Alan Malek , Jiawei Ge , Nevena Lazic , Chi Jin , András György , Csaba Szepesvári

Evaluating the Process Modeling Abilities of Large Language Models -- Preliminary Foundations and Results

Large language models (LLM) have revolutionized the processing of natural language. Although first benchmarks of the process modeling abilities of LLM are promising, it is currently under debate to what extent an LLM can generate good…

Computation and Language · Computer Science 2025-03-19 Peter Fettke , Constantin Houy

Can LLMs $\textit{understand}$ Math? -- Exploring the Pitfalls in Mathematical Reasoning

Large language models (LLMs) demonstrate considerable potential in various natural language tasks but face significant challenges in mathematical reasoning, particularly in executing precise, multi-step logic. However, current evaluation…

Computation and Language · Computer Science 2025-05-22 Tiasa Singha Roy , Aditeya Baral , Ayush Rajesh Jhaveri , Yusuf Baig