General Agent Evaluation

Elron Bandel; Asaf Yehudai; Lilach Eden; Yehoshua Sagron; Yotam Perlitz; Elad Venezian; Natalia Razinkov; Natan Ergas; Shlomit Shachor Ifergan; Segev Shlomov; Michal Jacovi; Leshem Choshen; Liat Ein-Dor; Yoav Katz; Michal Shmueli-Scheuer

General Agent Evaluation

Artificial Intelligence 2026-05-12 v2

Authors: Elron Bandel , Asaf Yehudai , Lilach Eden , Yehoshua Sagron , Yotam Perlitz , Elad Venezian , Natalia Razinkov , Natan Ergas , Shlomit Shachor Ifergan , Segev Shlomov , Michal Jacovi , Leshem Choshen , Liat Ein-Dor , Yoav Katz , Michal Shmueli-Scheuer

View on arXiv ↗ PDF ↗

Abstract

General-purpose agents perform tasks in unfamiliar environments without domain-specific manual customization. Yet no study has systematically measured how agent architecture shapes performance across heterogeneous protocols and diverse unfamiliar environments. This is the first systematic study, comparing tool-calling, MCP, code-generation, and CLI agents on the same benchmarks with the same models. Two gaps blocked such a study: existing harnesses require per-benchmark wiring or fixed protocol classes (web for BrowserGym, CLI for Harbor), and benchmarks themselves expect human-authored prompts, context, and integration glue. To enable this study, we contribute (1) a unifying protocol that bridges existing benchmark and agent protocols; (2) an evaluation harness that surfaces any benchmark to any general-purpose agent and backbone model; and (3) the first Open General Agent Leaderboard of agent configurations, a full factorial over 5 agent architectures x 5 backbone LLMs (three closed-source, two open-weight) x 6 benchmarks spanning software engineering, customer service, deep research, and personal assistance. We find that (i) general agents adapt to every tested domain without per-domain customization; (ii) agent architecture choice swings results by up to 12pp within a single model, yet backbone model choice dominates overall performance; (iii) on 4 of 6 tested benchmarks, top general agents are indistinguishable from the leading heavily-customized domain-specific agents; (iv) open-weight models tested exhibit "generality sinks" absent from frontier closed-source models: they consistently collapse on specific agent architectures or benchmarks; (v) a behavioral failure analysis reveals architecture-distinctive error signatures that aggregate scoring cannot discriminate. Code, harness, leaderboard, and traces are at https://www.exgentic.ai.

Keywords

benchmarking autonomous agents multi-agent systems

Cite

@article{arxiv.2602.22953,
  title  = {General Agent Evaluation},
  author = {Elron Bandel and Asaf Yehudai and Lilach Eden and Yehoshua Sagron and Yotam Perlitz and Elad Venezian and Natalia Razinkov and Natan Ergas and Shlomit Shachor Ifergan and Segev Shlomov and Michal Jacovi and Leshem Choshen and Liat Ein-Dor and Yoav Katz and Michal Shmueli-Scheuer},
  journal= {arXiv preprint arXiv:2602.22953},
  year   = {2026}
}

Comments

Presented at the ICLR 2026 Workshop on Agents in the Wild

General Agent Evaluation

Abstract

Keywords

Cite

Comments

Related papers