Related papers: Interactive Benchmarks

Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs

Large language models (LLMs) excel at solving problems with clear and complete statements, but often struggle with nuanced environments or interactive tasks which are common in most real-world scenarios. This highlights the critical need…

Computation and Language · Computer Science 2025-08-26 Kartikeya Badola , Jonathan Simon , Arian Hosseini , Sara Marie Mc Carthy , Tsendsuren Munkhdalai , Abhimanyu Goyal , Tomáš Kočiský , Shyam Upadhyay , Bahare Fatemi , Mehran Kazemi

Interactive Evaluation Requires a Design Science

AI evaluation is undergoing a structural change. Large language models (LLMs) are increasingly deployed as systems that act over time through tools, environments, users, and other agents, while many evaluation practices still inherit…

Artificial Intelligence · Computer Science 2026-05-19 Keyang Xuan , Peiyang Song , Pan Lu , Pengrui Han , Wenkai Li , Zhenyu Zhang , Zexue He , Wenyue Hua , Manling Li , Jiaxuan You , Adrian Weller , Yizhong Wang , Jiaxin Pei

Reasoning Capabilities of Large Language Models on Dynamic Tasks

Large language models excel on static benchmarks, but their ability as self-learning agents in dynamic environments remains unclear. We evaluate three prompting strategies: self-reflection, heuristic mutation, and planning across dynamic…

Artificial Intelligence · Computer Science 2025-08-12 Annie Wong , Thomas Bäck , Aske Plaat , Niki van Stein , Anna V. Kononova

How Far Can LLMs Improve from Experience? Measuring Test-Time Learning Ability in LLMs with Human Comparison

As evaluation designs of large language models may shape our trajectory toward artificial general intelligence, comprehensive and forward-looking assessment is essential. Existing benchmarks primarily assess static knowledge, while…

Computation and Language · Computer Science 2025-08-07 Jiayin Wang , Zhiquang Guo , Weizhi Ma , Min Zhang

Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests

We examine three evaluation paradigms: standard benchmarks (e.g., MMLU and BBH), interactive games (e.g., Signalling Games or Taboo), and cognitive tests (e.g., for working memory or theory of mind). First, we investigate which of the…

Computation and Language · Computer Science 2025-09-25 Filippo Momentè , Alessandro Suglia , Mario Giulianelli , Ambra Ferrari , Alexander Koller , Oliver Lemon , David Schlangen , Raquel Fernández , Raffaella Bernardi

MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models

IQ testing has served as a foundational methodology for evaluating human cognitive capabilities, deliberately decoupling assessment from linguistic background, language proficiency, or domain-specific knowledge to isolate core competencies…

Artificial Intelligence · Computer Science 2025-06-05 Huanqia Cai , Yijun Yang , Winston Hu

When Benchmarks Talk: Re-Evaluating Code LLMs with Interactive Feedback

Programming is a fundamentally interactive process, yet coding assistants are often evaluated using static benchmarks that fail to measure how well models collaborate with users. We introduce an interactive evaluation pipeline to examine…

Human-Computer Interaction · Computer Science 2025-02-26 Jane Pan , Ryan Shar , Jacob Pfau , Ameet Talwalkar , He He , Valerie Chen

Teach Me How to Improve My Argumentation Skills: A Survey on Feedback in Argumentation

The use of argumentation in education has been shown to improve critical thinking skills for end-users such as students, and computational models for argumentation have been developed to assist in this process. Although these models are…

Computation and Language · Computer Science 2023-07-31 Camélia Guerraoui , Paul Reisert , Naoya Inoue , Farjana Sultana Mim , Shoichi Naito , Jungmin Choi , Irfan Robbani , Wenzhi Wang , Kentaro Inui

From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?

While existing benchmarks probe the reasoning abilities of large language models (LLMs) across diverse domains, they predominantly assess passive reasoning, providing models with all the information needed to reach a solution. By contrast,…

Machine Learning · Computer Science 2025-06-11 Zhanke Zhou , Xiao Feng , Zhaocheng Zhu , Jiangchao Yao , Sanmi Koyejo , Bo Han

StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns

Long-term memory (LTM) is essential for large language models (LLMs) to achieve autonomous intelligence in complex, evolving environments. Despite increasing efforts in memory-augmented and retrieval-based architectures, there remains a…

Computation and Language · Computer Science 2025-06-17 Luanbo Wan , Weizhi Ma

Reasoning Elicitation in Language Models via Counterfactual Feedback

Despite the increasing effectiveness of language models, their reasoning capabilities remain underdeveloped. In particular, causal reasoning through counterfactual question answering is lacking. This work aims to bridge this gap. We first…

Computation and Language · Computer Science 2025-03-18 Alihan Hüyük , Xinnuo Xu , Jacqueline Maasch , Aditya V. Nori , Javier González

MULTI-Bench: A Multi-Turn Interactive Benchmark for Assessing Emotional Intelligence ability of Spoken Dialogue Models

Spoken Dialogue Models (SDMs) have advanced rapidly, yet their ability to sustain genuinely interactive multi-turn conversations remains underexplored, as most benchmarks focus on single-turn exchanges. We introduce Multi-Bench, the first…

Audio and Speech Processing · Electrical Eng. & Systems 2025-11-04 Yayue Deng , Guoqiang Hu , Haiyang Sun , Xiangyu Zhang , Haoyang Zhang , Fei Tian , Xuerui Yang , Gang Yu , Eng Siong Chng

MastermindEval: A Simple But Scalable Reasoning Benchmark

Recent advancements in large language models (LLMs) have led to remarkable performance across a wide range of language understanding and mathematical tasks. As a result, increasing attention has been given to assessing the true reasoning…

Computation and Language · Computer Science 2025-03-14 Jonas Golde , Patrick Haller , Fabio Barth , Alan Akbik

Benchmark-Driven Selection of AI: Evidence from DeepSeek-R1

Evaluation of reasoning language models gained importance after it was observed that they can combine their existing capabilities into novel traces of intermediate steps before task completion and that the traces can sometimes help them to…

Machine Learning · Computer Science 2025-08-15 Petr Spelda , Vit Stritecky

Reasoning about Bounded Reasoning

In experimental applications of bounded-reasoning models, behavior is often summarized by distributions of "levels". We argue that such summaries conflate two conceptually distinct dimensions: a player's type, capturing beliefs about what…

Theoretical Economics · Economics 2026-04-15 Shuige Liu , Gabriel Ziegler

Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models

Large language models (LLMs) regularly demonstrate new and impressive performance on a wide range of language, knowledge, and reasoning benchmarks. Such rapid progress has led many commentators to argue that LLM general cognitive…

Computation and Language · Computer Science 2025-02-21 James Fodor

Parameterized Argumentation-based Reasoning Tasks for Benchmarking Generative Language Models

Generative large language models as tools in the legal domain have the potential to improve the justice system. However, the reasoning behavior of current generative models is brittle and poorly understood, hence cannot be responsibly…

Artificial Intelligence · Computer Science 2025-05-06 Cor Steging , Silja Renooij , Bart Verheij

InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback

Existing benchmarks do not test Large Multimodal Models (LMMs) on their interactive intelligence with human users, which is vital for developing general-purpose AI assistants. We design InterFeedback, an interactive framework, which can be…

Computation and Language · Computer Science 2025-11-10 Henry Hengyuan Zhao , Wenqi Pei , Yifei Tao , Haiyang Mei , Mike Zheng Shou

Evaluating Language Models' Evaluations of Games

Reasoning is not just about solving problems -- it is also about evaluating which problems are worth solving at all. Evaluations of artificial intelligence (AI) systems primarily focused on problem solving, historically by studying how…

Computation and Language · Computer Science 2026-05-19 Katherine M. Collins , Cedegao E. Zhang , Graham Todd , Lance Ying , Mauricio Barba da Costa , Ryan Liu , Prafull Sharma , Adrian Weller , Ionatan Kuperwajs , Lionel Wong , Joshua B. Tenenbaum , Thomas L. Griffiths

Measuring AI Reasoning: A Guide for Researchers

In this paper, we offer a guide for researchers on evaluating reasoning in language models, building the case that reasoning should be assessed through evidence of adaptive, multi-step search rather than final-answer accuracy alone. Under…

Artificial Intelligence · Computer Science 2026-05-05 Munachiso Samuel Nwadike , Zangir Iklassov , Kareem Ali , Rifo Genadi , Kentaro Inui