English
Related papers

Related papers: Forecasting Frontier Language Model Agent Capabili…

200 papers

The advancement of large language model (LLM) based agents has shifted AI evaluation from single-turn response assessment to multi-step task completion in interactive environments. We present an empirical study evaluating frontier AI models…

Artificial Intelligence · Computer Science 2026-01-15 Logan Ritchie , Sushant Mehta , Nick Heiner , Mason Yu , Edwin Chen

The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present…

There is widespread optimism that frontier Large Language Models (LLMs) and LLM-augmented systems have the potential to rapidly accelerate scientific discovery across disciplines. Today, many benchmarks exist to measure LLM knowledge and…

While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce…

Artificial Intelligence · Computer Science 2026-03-17 Shengda Fan , Xuyan Ye , Yupeng Huo , Zhi-Yuan Chen , Yiju Guo , Shenzhi Yang , Wenkai Yang , Shuqi Ye , Jingwen Chen , Haotian Chen , Xin Cong , Yankai Lin

Large Language Models (LLMs) and LLM-based agents show great promise in accelerating scientific research. Existing benchmarks for measuring this potential and guiding future development continue to evolve from pure recall and rote knowledge…

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture…

Artificial Intelligence · Computer Science 2026-04-24 Keyu Li , Junhao Shi , Yang Xiao , Mohan Jiang , Jie Sun , Yunze Wu , Dayuan Fu , Shijie Xia , Xiaojie Cai , Tianze Xu , Weiye Si , Wenjie Li , Dequan Wang , Pengfei Liu

Large language model (LLM) evaluation is increasingly costly, prompting interest in methods that speed up evaluation by shrinking benchmark datasets. Benchmark prediction (also called efficient LLM evaluation) aims to select a small subset…

Machine Learning · Computer Science 2025-06-10 Guanhua Zhang , Florian E. Dorner , Moritz Hardt

Large language model (LLM) agents have demonstrated remarkable capabilities in software engineering and cybersecurity tasks, including code generation, vulnerability discovery, and automated testing. One critical but underexplored…

Software Engineering · Computer Science 2025-10-17 Bin Liu , Yanjie Zhao , Guoai Xu , Haoyu Wang

Forecasting future events is important for policy and decision making. In this work, we study whether language models (LMs) can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM…

Machine Learning · Computer Science 2024-02-29 Danny Halawi , Fred Zhang , Chen Yueh-Han , Jacob Steinhardt

Large language models (LLM) are perceived to offer promising potentials for automating security tasks, such as those found in security operation centers (SOCs). As a first step towards evaluating this perceived potential, we investigate the…

Cryptography and Security · Computer Science 2024-02-01 Kumar Shashwat , Francis Hahn , Xinming Ou , Dmitry Goldgof , Lawrence Hall , Jay Ligatti , S. Raj Rajgopalan , Armin Ziaie Tabari

Modern Large Language Model (LLM) agents promise end to end assistance with real-world software tasks, yet existing benchmarks evaluate LLM agents almost exclusively in pre-baked environments where every dependency is pre-installed. To fill…

Software Engineering · Computer Science 2025-07-15 Avi Arora , Jinu Jang , Roshanak Zilouchian Moghaddam

LLM based agents have recently demonstrated strong potential in automating complex tasks, yet accurately predicting startup success remains an open challenge with few benchmarks and tailored frameworks. To address these limitations, we…

Artificial Intelligence · Computer Science 2025-04-22 Xisen Wang , Yigit Ihlamur , Fuat Alican

Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress towards developing agents that can perform parts of the research…

Computation and Language · Computer Science 2026-04-23 Nicholas Edwards , Yukyung Lee , Yujun Audrey Mao , Yulu Qin , Sebastian Schuster , Najoung Kim

Recent work has investigated the capabilities of large language models (LLMs) as zero-shot models for generating individual-level characteristics (e.g., to serve as risk models or augment survey datasets). However, when should a user have…

AI agents could accelerate scientific discovery by automating hypothesis formation, experiment design, coding, execution, and analysis, yet existing benchmarks probe narrow skills in simplified settings. To address this gap, we introduce…

As LLM-based agents increasingly rely on external tools, it is important to evaluate their ability to sustain tool-grounded reasoning beyond familiar workflows and short-range interactions. We introduce AgentEscapeBench, an…

Artificial Intelligence · Computer Science 2026-05-21 Zhengkang Guo , Yiyang Li , Lin Qiu , Xiaohua Wang , Jingwen Xv , Dongyu Ru , Xiaoyu Li , Xiaoqing Zheng , Xuezhi Cao , Xunliang Cai

Forecasts of future events are essential inputs into informed decision-making. Machine learning (ML) systems have the potential to deliver forecasts at scale, but there is no framework for evaluating the accuracy of ML systems on a…

Machine Learning · Computer Science 2025-03-03 Ezra Karger , Houtan Bastani , Chen Yueh-Han , Zachary Jacobs , Danny Halawi , Fred Zhang , Philip E. Tetlock

Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly. In order to effectively participate in markets, agents need to have informative signals of their own ability to…

Artificial Intelligence · Computer Science 2026-04-28 Andrey Fradkin , Rohit Krishnan

Large Language Models (LLMs) have demonstrated remarkable capabilities on various tasks, while the further evolvement is limited to the lack of high-quality training data. In addition, traditional training approaches rely too much on…

Computation and Language · Computer Science 2025-02-14 Peidong Wang , Ming Wang , Zhiming Ma , Xiaocui Yang , Shi Feng , Daling Wang , Yifei Zhang , Kaisong Song

Recent advancements in Large Language Models (LLMs) have spurred interest in deploying LLM agents to undertake tasks in the world. LLMs are often deployed in agent systems: code that orchestrates LLM calls and provides them with tools. We…

Artificial Intelligence · Computer Science 2025-05-20 Maxime Robeyns , Martin Szummer , Laurence Aitchison
‹ Prev 1 2 3 10 Next ›