Computation and Language · Computer Science
LLMArena: Assessing Capabilities of Large Language Models in Dynamic Multi-Agent Environments
Junzhe Chen, Xuming Hu, Shuodi Liu, Shiyu Huang +3
2024-02-27
Artificial Intelligence · Computer Science
PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models
Wenlong Shi, Jianxun Lian, Mingqi Wu, Haiming Qin +4
2026-05-19
Artificial Intelligence · Computer Science
MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs
Kevin Wang, Anna Thöni, Benjamin Kempinski, Bobby Cheng +49
2026-05-29
Machine Learning · Computer Science
ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models
Jincheng Liu, Sijun He, Jingjing Wu, Xiangsen Wang +4
2026-04-24
Computation and Language · Computer Science
TimeArena: Shaping Efficient Multitasking Language Agents in a Time-Aware Simulation
Yikai Zhang, Siyu Yuan, Caiyu Hu, Kyle Richardson +2
2024-02-09
Artificial Intelligence · Computer Science
GameArena: Evaluating LLM Reasoning through Live Computer Games
Lanxiang Hu, Qiyu Li, Anze Xie, Nan Jiang +3
2025-02-18
Artificial Intelligence · Computer Science
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou +8
2024-04-17
Artificial Intelligence · Computer Science
TextQuests: How Good are LLMs at Text-Based Video Games?
Long Phan, Mantas Mazeika, Andy Zou, Dan Hendrycks
2025-08-15
Multiagent Systems · Computer Science
Arena: A General Evaluation Platform and Building Toolkit for Multi-Agent Intelligence
Yuhang Song, Andrzej Wojcicki, Thomas Lukasiewicz, Jianyi Wang +5
2019-12-02
Artificial Intelligence · Computer Science
PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature
Daoyu Wang, Mingyue Cheng, Shuo Yu, Zirui Liu +3
2026-02-02
Computation and Language · Computer Science
GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning
Qingchen Yu, Zifan Zheng, Ding Chen, Simin Niu +3
2025-05-29
Artificial Intelligence · Computer Science
Game Reasoning Arena: A Framework and Benchmark for Assessing Reasoning Capabilities of Large Language Models via Game Play
Lucia Cipolina-Kun, Marianna Nezhurina, Jenia Jitsev
2025-08-19
Machine Learning · Computer Science
TextWorld: A Learning Environment for Text-based Games
Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas +9
2019-11-11
Computation and Language · Computer Science
OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions
Fangzhi Xu, Hang Yan, Qiushi Sun, Jinyang Wu +15
2026-02-06
Computation and Language · Computer Science
{\Psi}-Arena: Interactive Assessment and Optimization of LLM-based Psychological Counselors with Tripartite Feedback
Shijing Zhu, Zhuang Chen, Guanqun Bi, Binghang Li +9
2025-05-07
Machine Learning · Computer Science
Large Language Models Can Self-Improve At Web Agent Tasks
Ajay Patel, Markus Hofmarcher, Claudiu Leoveanu-Condrei, Marius-Constantin Dinu +2
2024-10-03
Computation and Language · Computer Science
SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?
Yao Dou, Michel Galley, Baolin Peng, Chris Kedzie +5
2025-10-10
Artificial Intelligence · Computer Science
A Survey on Large Language Model-Based Game Agents
Sihao Hu, Tiansheng Huang, Gaowen Liu, Ramana Rao Kompella +5
2025-11-05
Artificial Intelligence · Computer Science
WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks
Léo Boisvert, Megh Thakkar, Maxime Gasse, Massimo Caccia +5
2025-02-07
Computation and Language · Computer Science
TextAtari: 100K Frames Game Playing with Language Agents
Wenhao Li, Wenwu Li, Chuyun Shen, Junjie Sheng +7
2025-06-11
Computation and Language · Computer Science
LLMsPark: A Benchmark for Evaluating Large Language Models in Strategic Gaming Contexts
Junhao Chen, Jingbo Sun, Xiang Li, Haidong Xin +3
2025-09-23
Artificial Intelligence · Computer Science
Understanding the Weakness of Large Language Model Agents within a Complex Android Environment
Mingzhe Xing, Rongkai Zhang, Hui Xue, Qi Chen +2
2024-02-12
Software Engineering · Computer Science
CodeArena: A Collective Evaluation Platform for LLM Code Generation
Mingzhe Du, Anh Tuan Luu, Bin Ji, Xiaobao Wu +4
2025-03-04
Machine Learning · Computer Science
ClawArena: Benchmarking AI Agents in Evolving Information Environments
Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia +8
2026-05-19