Related papers: Forecasting Frontier Language Model Agent Capabili…

The Hierarchy of Agentic Capabilities: Evaluating Frontier Models on Realistic RL Environments

The advancement of large language model (LLM) based agents has shifted AI evaluation from single-turn response assessment to multi-step task completion in interactive environments. We present an empirical study evaluating frontier AI models…

Artificial Intelligence · Computer Science 2026-01-15 Logan Ritchie , Sushant Mehta , Nick Heiner , Mason Yu , Edwin Chen

AgentBench: Evaluating LLMs as Agents

The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present…

Artificial Intelligence · Computer Science 2025-10-07 Xiao Liu , Hao Yu , Hanchen Zhang , Yifan Xu , Xuanyu Lei , Hanyu Lai , Yu Gu , Hangliang Ding , Kaiwen Men , Kejuan Yang , Shudan Zhang , Xiang Deng , Aohan Zeng , Zhengxiao Du , Chenhui Zhang , Sheng Shen , Tianjun Zhang , Yu Su , Huan Sun , Minlie Huang , Yuxiao Dong , Jie Tang

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

There is widespread optimism that frontier Large Language Models (LLMs) and LLM-augmented systems have the potential to rapidly accelerate scientific discovery across disciplines. Today, many benchmarks exist to measure LLM knowledge and…

Artificial Intelligence · Computer Science 2024-07-18 Jon M. Laurent , Joseph D. Janizek , Michael Ruzo , Michaela M. Hinks , Michael J. Hammerling , Siddharth Narayanan , Manvitha Ponnapati , Andrew D. White , Samuel G. Rodriques

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce…

Artificial Intelligence · Computer Science 2026-03-17 Shengda Fan , Xuyan Ye , Yupeng Huo , Zhi-Yuan Chen , Yiju Guo , Shenzhi Yang , Wenkai Yang , Shuqi Ye , Jingwen Chen , Haotian Chen , Xin Cong , Yankai Lin

BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology

Large Language Models (LLMs) and LLM-based agents show great promise in accelerating scientific research. Existing benchmarks for measuring this potential and guiding future development continue to evolve from pure recall and rote knowledge…

Quantitative Methods · Quantitative Biology 2025-10-10 Ludovico Mitchener , Jon M Laurent , Alex Andonian , Benjamin Tenmann , Siddharth Narayanan , Geemi P Wellawatte , Andrew White , Lorenzo Sani , Samuel G Rodriques

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture…

Artificial Intelligence · Computer Science 2026-04-24 Keyu Li , Junhao Shi , Yang Xiao , Mohan Jiang , Jie Sun , Yunze Wu , Dayuan Fu , Shijie Xia , Xiaojie Cai , Tianze Xu , Weiye Si , Wenjie Li , Dequan Wang , Pengfei Liu

How Benchmark Prediction from Fewer Data Misses the Mark

Large language model (LLM) evaluation is increasingly costly, prompting interest in methods that speed up evaluation by shrinking benchmark datasets. Benchmark prediction (also called efficient LLM evaluation) aims to select a small subset…

Machine Learning · Computer Science 2025-06-10 Guanhua Zhang , Florian E. Dorner , Moritz Hardt

LLM Agents for Automated Web Vulnerability Reproduction: Are We There Yet?

Large language model (LLM) agents have demonstrated remarkable capabilities in software engineering and cybersecurity tasks, including code generation, vulnerability discovery, and automated testing. One critical but underexplored…

Software Engineering · Computer Science 2025-10-17 Bin Liu , Yanjie Zhao , Guoai Xu , Haoyu Wang

Approaching Human-Level Forecasting with Language Models

Forecasting future events is important for policy and decision making. In this work, we study whether language models (LMs) can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM…

Machine Learning · Computer Science 2024-02-29 Danny Halawi , Fred Zhang , Chen Yueh-Han , Jacob Steinhardt

A Preliminary Study on Using Large Language Models in Software Pentesting

Large language models (LLM) are perceived to offer promising potentials for automating security tasks, such as those found in security operation centers (SOCs). As a first step towards evaluating this perceived potential, we investigate the…

Cryptography and Security · Computer Science 2024-02-01 Kumar Shashwat , Francis Hahn , Xinming Ou , Dmitry Goldgof , Lawrence Hall , Jay Ligatti , S. Raj Rajgopalan , Armin Ziaie Tabari

SetupBench: Assessing Software Engineering Agents' Ability to Bootstrap Development Environments

Modern Large Language Model (LLM) agents promise end to end assistance with real-world software tasks, yet existing benchmarks evaluate LLM agents almost exclusively in pre-baked environments where every dependency is pre-installed. To fill…

Software Engineering · Computer Science 2025-07-15 Avi Arora , Jinu Jang , Roshanak Zilouchian Moghaddam

SSFF: Investigating LLM Predictive Capabilities for Startup Success through a Multi-Agent Framework with Enhanced Explainability and Performance

LLM based agents have recently demonstrated strong potential in automating complex tasks, yet accurately predicting startup success remains an open challenge with few benchmarks and tailored frameworks. To address these limitations, we…

Artificial Intelligence · Computer Science 2025-04-22 Xisen Wang , Yigit Ihlamur , Fuat Alican

RExBench: Can coding agents autonomously implement AI research extensions?

Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress towards developing agents that can perform parts of the research…

Computation and Language · Computer Science 2026-04-23 Nicholas Edwards , Yukyung Lee , Yujun Audrey Mao , Yulu Qin , Sebastian Schuster , Najoung Kim

Predicting Language Models' Success at Zero-Shot Probabilistic Prediction

Recent work has investigated the capabilities of large language models (LLMs) as zero-shot models for generating individual-level characteristics (e.g., to serve as risk models or augment survey datasets). However, when should a user have…

Machine Learning · Computer Science 2025-09-22 Kevin Ren , Santiago Cortes-Gomez , Carlos Miguel Patiño , Ananya Joshi , Ruiqi Lyu , Jingjing Tang , Alistair Turcan , Khurram Yamin , Steven Wu , Bryan Wilder

InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research

AI agents could accelerate scientific discovery by automating hypothesis formation, experiment design, coding, execution, and analysis, yet existing benchmarks probe narrow skills in simplified settings. To address this gap, we introduce…

Artificial Intelligence · Computer Science 2025-11-04 Yunze Wu , Dayuan Fu , Weiye Si , Zhen Huang , Mohan Jiang , Keyu Li , Shijie Xia , Jie Sun , Tianze Xu , Xiangkun Hu , Pengrui Lu , Xiaojie Cai , Lyumanshan Ye , Wenhong Zhu , Yang Xiao , Pengfei Liu

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

As LLM-based agents increasingly rely on external tools, it is important to evaluate their ability to sustain tool-grounded reasoning beyond familiar workflows and short-range interactions. We introduce AgentEscapeBench, an…

Artificial Intelligence · Computer Science 2026-05-21 Zhengkang Guo , Yiyang Li , Lin Qiu , Xiaohua Wang , Jingwen Xv , Dongyu Ru , Xiaoyu Li , Xiaoqing Zheng , Xuezhi Cao , Xunliang Cai

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

Forecasts of future events are essential inputs into informed decision-making. Machine learning (ML) systems have the potential to deliver forecasts at scale, but there is no framework for evaluating the accuracy of ML systems on a…

Machine Learning · Computer Science 2025-03-03 Ezra Karger , Houtan Bastani , Chen Yueh-Han , Zachary Jacobs , Danny Halawi , Fred Zhang , Philip E. Tetlock

MarketBench: Evaluating AI Agents as Market Participants

Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly. In order to effectively participate in markets, agents need to have informative signals of their own ability to…

Artificial Intelligence · Computer Science 2026-04-28 Andrey Fradkin , Rohit Krishnan

Language Models as Continuous Self-Evolving Data Engineers

Large Language Models (LLMs) have demonstrated remarkable capabilities on various tasks, while the further evolvement is limited to the lack of high-quality training data. In addition, traditional training approaches rely too much on…

Computation and Language · Computer Science 2025-02-14 Peidong Wang , Ming Wang , Zhiming Ma , Xiaocui Yang , Shi Feng , Daling Wang , Yifei Zhang , Kaisong Song

A Self-Improving Coding Agent

Recent advancements in Large Language Models (LLMs) have spurred interest in deploying LLM agents to undertake tasks in the world. LLMs are often deployed in agent systems: code that orchestrates LLM calls and provides them with tools. We…

Artificial Intelligence · Computer Science 2025-05-20 Maxime Robeyns , Martin Szummer , Laurence Aitchison