English

Interactive Benchmarks

Artificial Intelligence 2026-05-19 v4 Computation and Language Machine Learning

Abstract

Existing reasoning evaluation paradigms suffer from different limitations: fixed benchmarks are increasingly saturated and vulnerable to contamination, while preference-based evaluations rely on subjective judgments. We argue that a core aspect of intelligence is the ability to decide what information to acquire and how to use it effectively. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses a model's reasoning ability through budgeted multi-turn interaction. We evaluate models under this framework in two settings: Interactive Proofs, where models interact with a judge to solve Logic, UI2Html, and Mathematics tasks under objective feedback; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a more robust assessment of this dimension of model intelligence, revealing substantial room for improvement in interactive scenarios.

Keywords

Cite

@article{arxiv.2603.04737,
  title  = {Interactive Benchmarks},
  author = {Baoqing Yue and Zihan Zhu and Yutong Han and Brian Fan and Qian Sun and Jichen Feng and Hufei Yang and Yifan Zhang and Mengdi Wang},
  journal= {arXiv preprint arXiv:2603.04737},
  year   = {2026}
}

Comments

Project Page: https://github.com/interactivebench/interactivebench