Related papers: Agent-Diff: Benchmarking LLM Agents on Enterprise …

An Empirical Study of Agent Developer Practices in AI Agent Frameworks

The rise of large language models (LLMs) has sparked a surge of interest in agents, leading to the rapid growth of agent frameworks. Agent frameworks are software toolkits and libraries that provide standardized components, abstractions,…

Software Engineering · Computer Science 2025-12-02 Yanlin Wang , Xinyi Xu , Jiachi Chen , Tingting Bi , Wenchao Gu , Zibin Zheng

A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System

The integration of Large Language Models (LLMs) into software engineering has driven a transition from traditional rule-based systems to autonomous agentic systems capable of solving complex problems. However, systematic progress is…

Software Engineering · Computer Science 2025-10-24 Jiale Guo , Suizhi Huang , Mei Li , Dong Huang , Xingsheng Chen , Regina Zhang , Zhijiang Guo , Han Yu , Siu-Ming Yiu , Pietro Lio , Kwok-Yan Lam

Benchmarking and Studying the LLM-based Agent System in End-to-End Software Development

The development of LLM-based autonomous agents for end-to-end software development represents a significant paradigm shift in software engineering. However, the scientific evaluation of these systems is hampered by significant challenges,…

Software Engineering · Computer Science 2025-11-07 Zhengran Zeng , Yixin Li , Rui Xie , Wei Ye , Shikun Zhang

AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios

Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications. Growing research efforts aim to develop LLM-based agents to address practical demands, introducing a new challenge: agentic scenarios…

Artificial Intelligence · Computer Science 2025-05-23 Yunjia Qi , Hao Peng , Xiaozhi Wang , Amy Xin , Youfeng Liu , Bin Xu , Lei Hou , Juanzi Li

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture…

Artificial Intelligence · Computer Science 2026-04-24 Keyu Li , Junhao Shi , Yang Xiao , Mohan Jiang , Jie Sun , Yunze Wu , Dayuan Fu , Shijie Xia , Xiaojie Cai , Tianze Xu , Weiye Si , Wenjie Li , Dequan Wang , Pengfei Liu

FeatureBench: Benchmarking Agentic Coding for Complex Feature Development

Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current…

Software Engineering · Computer Science 2026-02-12 Qixing Zhou , Jiacheng Zhang , Haiyang Wang , Rui Hao , Jiahe Wang , Minghao Han , Yuxue Yang , Shuzhe Wu , Feiyang Pan , Lue Fan , Dandan Tu , Zhaoxiang Zhang

Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows

Large language model (LLM) agents are increasingly expected to operate in enterprise environments, where work is distributed across specialized roles, permission-controlled systems, and cross-departmental procedures. However, existing…

Multiagent Systems · Computer Science 2026-05-12 Tao Yu , Hao Wang , Changyu Li , Shenghua Chai , Minghui Zhang , Zhongtian Luo , Yuxuan Zhou , Haopeng Jin , Zhaolu Kang , Jiabing Yang , YiFan Zhang , Xinming Wang , Hongzhu Yi , Zheqi He , Jing-Shu Zheng , Xi Yang , Yan Huang , Liang Wang

AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents

The advances made by Large Language Models (LLMs) have led to the pursuit of LLM agents that can solve intricate, multi-step reasoning tasks. As with any research pursuit, benchmarking and evaluation are key corner stones to efficient and…

Artificial Intelligence · Computer Science 2024-04-10 Luca Gioacchini , Giuseppe Siracusano , Davide Sanvito , Kiril Gashteovski , David Friede , Roberto Bifulco , Carolin Lawrence

AgentBench: Evaluating LLMs as Agents

The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present…

Artificial Intelligence · Computer Science 2025-10-07 Xiao Liu , Hao Yu , Hanchen Zhang , Yifan Xu , Xuanyu Lei , Hanyu Lai , Yu Gu , Hangliang Ding , Kaiwen Men , Kejuan Yang , Shudan Zhang , Xiang Deng , Aohan Zeng , Zhengxiao Du , Chenhui Zhang , Sheng Shen , Tianjun Zhang , Yu Su , Huan Sun , Minlie Huang , Yuxiao Dong , Jie Tang

A Unified Framework for the Evaluation of LLM Agentic Capabilities

As LLMs are increasingly deployed as agents, reliable assessment of their agentic capabilities has become essential. However, reported benchmark scores often jointly reflect model capability and the implementation choices each benchmark is…

Artificial Intelligence · Computer Science 2026-05-28 Pengyu Zhu , Lijun Li , Yaxing Lyu , Qianxin Luo , Jingyi Yang , Yi Liu , Tingfeng Hui , Xinyu Yuan , Li Sun , Sen Su , Jing Shao

Agent psychometrics: Task-level performance prediction in agentic coding benchmarks

As the focus in LLM-based coding shifts from static single-step code generation to multi-step agentic interaction with tools and environments, understanding which tasks will challenge agents and why becomes increasingly difficult. This is…

Artificial Intelligence · Computer Science 2026-04-02 Chris Ge , Daria Kryvosheieva , Daniel Fried , Uzay Girit , Kaivalya Hariharan

AgentForge: Execution-Grounded Multi-Agent LLM Framework for Autonomous Software Engineering

Large language models generate plausible code but cannot verify correctness. Existing multi-agent systems simulate execution or leave verification optional. We introduce execution-grounded verification as a first-class principle: every code…

Software Engineering · Computer Science 2026-04-16 Rajesh Kumar , Waqar Ali , Junaid Ahmed , Najma Imtiaz Ali , Shaban Usman

Efficient Agents: Building Effective Agents While Reducing Cost

The remarkable capabilities of Large Language Model (LLM)-driven agents have enabled sophisticated systems to tackle complex, multi-step tasks, but their escalating costs threaten scalability and accessibility. This work presents the first…

Artificial Intelligence · Computer Science 2025-08-06 Ningning Wang , Xavier Hu , Pai Liu , He Zhu , Yue Hou , Heyuan Huang , Shengyu Zhang , Jian Yang , Jiaheng Liu , Ge Zhang , Changwang Zhang , Jun Wang , Yuchen Eleanor Jiang , Wangchunshu Zhou

AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems

The emergence of agentic recommender systems powered by Large Language Models (LLMs) represents a paradigm shift in personalized recommendations, leveraging LLMs' advanced reasoning and role-playing capabilities to enable autonomous,…

Information Retrieval · Computer Science 2025-05-29 Yu Shang , Peijie Liu , Yuwei Yan , Zijing Wu , Leheng Sheng , Yuanqing Yu , Chumeng Jiang , An Zhang , Fengli Xu , Yu Wang , Min Zhang , Yong Li

BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

Large Language Models (LLMs) are becoming increasingly powerful and capable of handling complex tasks, e.g., building single agents and multi-agent systems. Compared to single agents, multi-agent systems have higher requirements for the…

Computation and Language · Computer Science 2024-08-29 Wei Wang , Dan Zhang , Tao Feng , Boyan Wang , Jie Tang

Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation

This paper presents a benchmark self-evolving framework to dynamically evaluate rapidly advancing Large Language Models (LLMs), aiming for a more accurate assessment of their capabilities and limitations. We utilize a multi-agent system to…

Computation and Language · Computer Science 2024-02-20 Siyuan Wang , Zhuohan Long , Zhihao Fan , Zhongyu Wei , Xuanjing Huang

Beyond Black-Box Benchmarking: Observability, Analytics, and Optimization of Agentic Systems

The rise of agentic AI systems, where agents collaborate to perform diverse tasks, poses new challenges with observing, analyzing and optimizing their behavior. Traditional evaluation and benchmarking approaches struggle to handle the…

Artificial Intelligence · Computer Science 2025-03-11 Dany Moshkovich , Hadar Mulian , Sergey Zeltyn , Natti Eder , Inna Skarbovsky , Roy Abitbol

AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

Evaluating Large Language Models (LLMs) as general-purpose agents is essential for understanding their capabilities and facilitating their integration into practical applications. However, the evaluation process presents substantial…

Computation and Language · Computer Science 2024-12-25 Chang Ma , Junlei Zhang , Zhihao Zhu , Cheng Yang , Yujiu Yang , Yaohui Jin , Zhenzhong Lan , Lingpeng Kong , Junxian He

Advancing Agentic Systems: Dynamic Task Decomposition, Tool Integration and Evaluation using Novel Metrics and Dataset

Advancements in Large Language Models (LLMs) are revolutionizing the development of autonomous agentic systems by enabling dynamic, context-aware task decomposition and automated tool selection. These sophisticated systems possess…

Artificial Intelligence · Computer Science 2024-10-31 Adrian Garret Gabriel , Alaa Alameer Ahmad , Shankar Kumar Jeyakumar

The Necessity of a Unified Framework for LLM-Based Agent Evaluation

With the advent of Large Language Models (LLMs), general-purpose agents have seen fundamental advancements. However, evaluating these agents presents unique challenges that distinguish them from static QA benchmarks. We observe that current…

Artificial Intelligence · Computer Science 2026-05-27 Pengyu Zhu , Li Sun , Philip S. Yu , Sen Su