Existing reasoning evaluation paradigms suffer from different limitations: fixed benchmarks are increasingly saturated and vulnerable to contamination, while preference-based evaluations rely on subjective judgments. We argue that a core aspect of intelligence is the ability to decide what information to acquire and how to use it effectively. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses a model's reasoning ability through budgeted multi-turn interaction. We evaluate models under this framework in two settings: Interactive Proofs, where models interact with a judge to solve Logic, UI2Html, and Mathematics tasks under objective feedback; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a more robust assessment of this dimension of model intelligence, revealing substantial room for improvement in interactive scenarios.
@article{arxiv.2603.04737,
title = {Interactive Benchmarks},
author = {Baoqing Yue and Zihan Zhu and Yutong Han and Brian Fan and Qian Sun and Jichen Feng and Hufei Yang and Yifan Zhang and Mengdi Wang},
journal= {arXiv preprint arXiv:2603.04737},
year = {2026}
}