English
Related papers

Related papers: Interactive Benchmarks

200 papers

Large language models (LLMs) excel at solving problems with clear and complete statements, but often struggle with nuanced environments or interactive tasks which are common in most real-world scenarios. This highlights the critical need…

AI evaluation is undergoing a structural change. Large language models (LLMs) are increasingly deployed as systems that act over time through tools, environments, users, and other agents, while many evaluation practices still inherit…

Artificial Intelligence · Computer Science 2026-05-19 Keyang Xuan , Peiyang Song , Pan Lu , Pengrui Han , Wenkai Li , Zhenyu Zhang , Zexue He , Wenyue Hua , Manling Li , Jiaxuan You , Adrian Weller , Yizhong Wang , Jiaxin Pei

Large language models excel on static benchmarks, but their ability as self-learning agents in dynamic environments remains unclear. We evaluate three prompting strategies: self-reflection, heuristic mutation, and planning across dynamic…

Artificial Intelligence · Computer Science 2025-08-12 Annie Wong , Thomas Bäck , Aske Plaat , Niki van Stein , Anna V. Kononova

As evaluation designs of large language models may shape our trajectory toward artificial general intelligence, comprehensive and forward-looking assessment is essential. Existing benchmarks primarily assess static knowledge, while…

Computation and Language · Computer Science 2025-08-07 Jiayin Wang , Zhiquang Guo , Weizhi Ma , Min Zhang

We examine three evaluation paradigms: standard benchmarks (e.g., MMLU and BBH), interactive games (e.g., Signalling Games or Taboo), and cognitive tests (e.g., for working memory or theory of mind). First, we investigate which of the…

IQ testing has served as a foundational methodology for evaluating human cognitive capabilities, deliberately decoupling assessment from linguistic background, language proficiency, or domain-specific knowledge to isolate core competencies…

Artificial Intelligence · Computer Science 2025-06-05 Huanqia Cai , Yijun Yang , Winston Hu

Programming is a fundamentally interactive process, yet coding assistants are often evaluated using static benchmarks that fail to measure how well models collaborate with users. We introduce an interactive evaluation pipeline to examine…

Human-Computer Interaction · Computer Science 2025-02-26 Jane Pan , Ryan Shar , Jacob Pfau , Ameet Talwalkar , He He , Valerie Chen

The use of argumentation in education has been shown to improve critical thinking skills for end-users such as students, and computational models for argumentation have been developed to assist in this process. Although these models are…

Computation and Language · Computer Science 2023-07-31 Camélia Guerraoui , Paul Reisert , Naoya Inoue , Farjana Sultana Mim , Shoichi Naito , Jungmin Choi , Irfan Robbani , Wenzhi Wang , Kentaro Inui

While existing benchmarks probe the reasoning abilities of large language models (LLMs) across diverse domains, they predominantly assess passive reasoning, providing models with all the information needed to reach a solution. By contrast,…

Machine Learning · Computer Science 2025-06-11 Zhanke Zhou , Xiao Feng , Zhaocheng Zhu , Jiangchao Yao , Sanmi Koyejo , Bo Han

Long-term memory (LTM) is essential for large language models (LLMs) to achieve autonomous intelligence in complex, evolving environments. Despite increasing efforts in memory-augmented and retrieval-based architectures, there remains a…

Computation and Language · Computer Science 2025-06-17 Luanbo Wan , Weizhi Ma

Despite the increasing effectiveness of language models, their reasoning capabilities remain underdeveloped. In particular, causal reasoning through counterfactual question answering is lacking. This work aims to bridge this gap. We first…

Computation and Language · Computer Science 2025-03-18 Alihan Hüyük , Xinnuo Xu , Jacqueline Maasch , Aditya V. Nori , Javier González

Spoken Dialogue Models (SDMs) have advanced rapidly, yet their ability to sustain genuinely interactive multi-turn conversations remains underexplored, as most benchmarks focus on single-turn exchanges. We introduce Multi-Bench, the first…

Audio and Speech Processing · Electrical Eng. & Systems 2025-11-04 Yayue Deng , Guoqiang Hu , Haiyang Sun , Xiangyu Zhang , Haoyang Zhang , Fei Tian , Xuerui Yang , Gang Yu , Eng Siong Chng

Recent advancements in large language models (LLMs) have led to remarkable performance across a wide range of language understanding and mathematical tasks. As a result, increasing attention has been given to assessing the true reasoning…

Computation and Language · Computer Science 2025-03-14 Jonas Golde , Patrick Haller , Fabio Barth , Alan Akbik

Evaluation of reasoning language models gained importance after it was observed that they can combine their existing capabilities into novel traces of intermediate steps before task completion and that the traces can sometimes help them to…

Machine Learning · Computer Science 2025-08-15 Petr Spelda , Vit Stritecky

In experimental applications of bounded-reasoning models, behavior is often summarized by distributions of "levels". We argue that such summaries conflate two conceptually distinct dimensions: a player's type, capturing beliefs about what…

Theoretical Economics · Economics 2026-04-15 Shuige Liu , Gabriel Ziegler

Large language models (LLMs) regularly demonstrate new and impressive performance on a wide range of language, knowledge, and reasoning benchmarks. Such rapid progress has led many commentators to argue that LLM general cognitive…

Computation and Language · Computer Science 2025-02-21 James Fodor

Generative large language models as tools in the legal domain have the potential to improve the justice system. However, the reasoning behavior of current generative models is brittle and poorly understood, hence cannot be responsibly…

Artificial Intelligence · Computer Science 2025-05-06 Cor Steging , Silja Renooij , Bart Verheij

Existing benchmarks do not test Large Multimodal Models (LMMs) on their interactive intelligence with human users, which is vital for developing general-purpose AI assistants. We design InterFeedback, an interactive framework, which can be…

Computation and Language · Computer Science 2025-11-10 Henry Hengyuan Zhao , Wenqi Pei , Yifei Tao , Haiyang Mei , Mike Zheng Shou

Reasoning is not just about solving problems -- it is also about evaluating which problems are worth solving at all. Evaluations of artificial intelligence (AI) systems primarily focused on problem solving, historically by studying how…

In this paper, we offer a guide for researchers on evaluating reasoning in language models, building the case that reasoning should be assessed through evidence of adaptive, multi-step search rather than final-answer accuracy alone. Under…

Artificial Intelligence · Computer Science 2026-05-05 Munachiso Samuel Nwadike , Zangir Iklassov , Kareem Ali , Rifo Genadi , Kentaro Inui
‹ Prev 1 2 3 10 Next ›