Related papers: General Agent Evaluation

Benchmark Test-Time Scaling of General LLM Agents

LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating…

Artificial Intelligence · Computer Science 2026-02-24 Xiaochuan Li , Ryan Ming , Pranav Setlur , Abhijay Paladugu , Andy Tang , Hao Kang , Shuai Shao , Rong Jin , Chenyan Xiong

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

AI agents have been developed for complex real-world tasks from coding to customer service. But AI agent evaluations suffer from many challenges that undermine our understanding of how well agents really work. We introduce the Holistic…

Artificial Intelligence · Computer Science 2025-10-15 Sayash Kapoor , Benedikt Stroebl , Peter Kirgis , Nitya Nadgir , Zachary S Siegel , Boyi Wei , Tianci Xue , Ziru Chen , Felix Chen , Saiteja Utpala , Franck Ndzomga , Dheeraj Oruganty , Sophie Luskin , Kangheng Liu , Botao Yu , Amit Arora , Dongyoon Hahm , Harsh Trivedi , Huan Sun , Juyong Lee , Tengjun Jin , Yifan Mai , Yifei Zhou , Yuxuan Zhu , Rishi Bommasani , Daniel Kang , Dawn Song , Peter Henderson , Yu Su , Percy Liang , Arvind Narayanan

A Comprehensive Empirical Evaluation of Agent Frameworks on Code-centric Software Engineering Tasks

Unlike traditional automation tools or static LLM-based systems, agents combine decision-making and tool utilization to accomplish complex tasks, showing great potential in software engineering. However, existing studies largely focus on…

Software Engineering · Computer Science 2025-11-04 Zhuowen Yin , Cuifeng Gao , Chunsong Fan , Wenzhang Yang , Yinxing Xue , Lijun Zhang

Architectural Design Decisions in AI Agent Harnesses

AI agent systems increasingly rely on reusable non-LLM engineering infrastructure that packages tool mediation, context handling, delegation, safety control, and orchestration. Yet the architectural design decisions in this surrounding…

Artificial Intelligence · Computer Science 2026-04-21 Hu Wei

Code as Agent Harness

Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agentic systems, code is no longer only a…

Computation and Language · Computer Science 2026-05-19 Xuying Ning , Katherine Tieu , Dongqi Fu , Tianxin Wei , Zihao Li , Yuanchen Bei , Jiaru Zou , Mengting Ai , Zhining Liu , Ting-Wei Li , Lingjie Chen , Yanjun Zhao , Ke Yang , Bingxuan Li , Cheng Qian , Gaotang Li , Xiao Lin , Zhichen Zeng , Ruizhong Qiu , Sirui Chen , Yifan Sun , Xiyuan Yang , Ruida Wang , Rui Pan , Chenyuan Yang , Dylan Zhang , Liri Fang , Zikun Cui , Yang Cao , Pan Chen , Dorothy Sun , Ren Chen , Mahesh Srinivasan , Nipun Mathur , Yinglong Xia , Hong Li , Hong Yan , Pan Lu , Lingming Zhang , Tong Zhang , Hanghang Tong , Jingrui He

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

Autonomous agents have rapidly matured as task executors and seen widespread deployment via harnesses such as OpenClaw. Safety concerns have rightly drawn growing research attention, and beneath them lie the values silently steering agent…

Artificial Intelligence · Computer Science 2026-05-12 Haonan Dong , Qiguan Feng , Kehan Jiang , Haoran Ye , Xin Zhang , Guojie Song

Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

LLM agents are increasingly deployed as executable systems that use tools, modify workspaces, and produce concrete artifacts. In such workflows, performance depends not only on the base model, but also on the harness: the system layer that…

Artificial Intelligence · Computer Science 2026-05-28 Yilun Yao , Xinyu Tan , Chao-Hsuan Liu , Yaoming Li , Zhengyang Wang , Wenhan Yu , Zhewen Tan , Yuxuan Tian , Guangxiang Zhao , Lin Sun , Xiangzheng Zhang , Tong Yang

OAgents: An Empirical Study of Building Effective Agents

Recently, Agentic AI has become an increasingly popular research field. However, we argue that current agent research practices lack standardization and scientific rigor, making it hard to conduct fair comparisons among methods. As a…

Artificial Intelligence · Computer Science 2025-06-24 He Zhu , Tianrui Qin , King Zhu , Heyuan Huang , Yeyi Guan , Jinxiang Xia , Yi Yao , Hanhao Li , Ningning Wang , Pai Liu , Tianhao Peng , Xin Gui , Xiaowan Li , Yuhui Liu , Yuchen Eleanor Jiang , Jun Wang , Changwang Zhang , Xiangru Tang , Ge Zhang , Jian Yang , Minghao Liu , Xitong Gao , Jiaheng Liu , Wangchunshu Zhou

UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios

Autonomous agents have recently achieved remarkable progress across diverse domains, yet most evaluations focus on short-horizon, fully observable tasks. In contrast, many critical real-world tasks, such as large-scale software development,…

Artificial Intelligence · Computer Science 2025-09-29 Haotian Luo , Huaisong Zhang , Xuelin Zhang , Haoyu Wang , Zeyu Qin , Wenjie Lu , Guozheng Ma , Haiying He , Yingsha Xie , Qiyang Zhou , Zixuan Hu , Hongze Mi , Yibo Wang , Naiqiang Tan , Hong Chen , Yi R. Fung , Chun Yuan , Li Shen

AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise

While individual components of agentic architectures have been studied in isolation, there remains limited empirical understanding of how different design dimensions interact within complex multi-agent systems. This study aims to address…

Artificial Intelligence · Computer Science 2026-01-07 Tara Bogavelli , Roshnee Sharma , Hari Subramani

Benchmarking and Studying the LLM-based Agent System in End-to-End Software Development

The development of LLM-based autonomous agents for end-to-end software development represents a significant paradigm shift in software engineering. However, the scientific evaluation of these systems is hampered by significant challenges,…

Software Engineering · Computer Science 2025-11-07 Zhengran Zeng , Yixin Li , Rui Xie , Wei Ye , Shikun Zhang

Towards a Science of Scaling Agent Systems

Agents, language model-based systems capable of reasoning, planning, and acting are widely adopted in real-world tasks, yet how their performance changes as these systems scale across key dimensions remains underexplored. We introduce…

Artificial Intelligence · Computer Science 2026-04-10 Yubin Kim , Ken Gu , Chanwoo Park , Chunjong Park , Samuel Schmidgall , A. Ali Heydari , Yao Yan , Zhihan Zhang , Yuchen Zhuang , Yun Liu , Mark Malhotra , Paul Pu Liang , Hae Won Park , Yuzhe Yang , Xuhai Xu , Yilun Du , Shwetak Patel , Tim Althoff , Daniel McDuff , Xin Liu

AI Agent Systems: Architectures, Applications, and Evaluation

AI agents -- systems that combine foundation models with reasoning, planning, memory, and tool use -- are rapidly becoming a practical interface between natural-language intent and real-world computation. This survey synthesizes the…

Artificial Intelligence · Computer Science 2026-01-06 Bin Xu

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture…

Artificial Intelligence · Computer Science 2026-04-24 Keyu Li , Junhao Shi , Yang Xiao , Mohan Jiang , Jie Sun , Yunze Wu , Dayuan Fu , Shijie Xia , Xiaojie Cai , Tianze Xu , Weiye Si , Wenjie Li , Dequan Wang , Pengfei Liu

From Model Scaling to System Scaling: Scaling the Harness in Agentic AI

This paper studies the next major bottleneck in agentic AI as system scaling, not only model scaling: the design of auditable, persistent, modular, and verifiable architectures around foundation models. We refer to this shift as scaling the…

Artificial Intelligence · Computer Science 2026-05-26 Shangding Gu

Understanding Multi-Agent LLM Frameworks: A Unified Benchmark and Experimental Analysis

Multi-agent LLM frameworks are widely used to accelerate the development of agent systems powered by large language models (LLMs). These frameworks impose distinct architectural structures that govern how agents interact, store information,…

Artificial Intelligence · Computer Science 2026-02-04 Abdelghny Orogat , Ana Rostam , Essam Mansour

JoyAgent-JDGenie: Technical Report on the GAIA

Large Language Models are increasingly deployed as autonomous agents for complex real-world tasks, yet existing systems often focus on isolated improvements without a unifying design for robustness and adaptability. We propose a generalist…

Computation and Language · Computer Science 2025-10-02 Jiarun Liu , Shiyue Xu , Shangkun Liu , Yang Li , Wen Liu , Min Liu , Xiaoqing Zhou , Hanmin Wang , Shilin Jia , zhen Wang , Shaohua Tian , Hanhao Li , Junbo Zhang , Yongli Yu , Peng Cao , Haofen Wang

ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

Current LLM agents are proficient at calling isolated APIs but struggle with the "last mile" of commercial software automation. In real-world scenarios, tools are not independent; they are atomic, interdependent, and prone to environmental…

Artificial Intelligence · Computer Science 2026-05-21 Yuanyang Li , Xue Yang , Longyue Wang , Weihua Luo , Hongyang Chen

LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack…

Software Engineering · Computer Science 2026-02-27 Yukang Feng , Jianwen Sun , Zelai Yang , Jiaxin Ai , Chuanhao Li , Zizhen Li , Fanrui Zhang , Kang He , Rui Ma , Jifan Lin , Jie Sun , Yang Xiao , Sizhuo Zhou , Wenxiao Wu , Yiming Liu , Pengfei Liu , Yu Qiao , Shenglin Zhang , Kaipeng Zhang

A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System

The integration of Large Language Models (LLMs) into software engineering has driven a transition from traditional rule-based systems to autonomous agentic systems capable of solving complex problems. However, systematic progress is…

Software Engineering · Computer Science 2025-10-24 Jiale Guo , Suizhi Huang , Mei Li , Dong Huang , Xingsheng Chen , Regina Zhang , Zhijiang Guo , Han Yu , Siu-Ming Yiu , Pietro Lio , Kwok-Yan Lam