English
Related papers

Related papers: SetupBench: Assessing Software Engineering Agents'…

200 papers

The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present…

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture…

Artificial Intelligence · Computer Science 2026-04-24 Keyu Li , Junhao Shi , Yang Xiao , Mohan Jiang , Jie Sun , Yunze Wu , Dayuan Fu , Shijie Xia , Xiaojie Cai , Tianze Xu , Weiye Si , Wenjie Li , Dequan Wang , Pengfei Liu

Large language model (LLM) agents are increasingly expected to operate in enterprise environments, where work is distributed across specialized roles, permission-controlled systems, and cross-departmental procedures. However, existing…

Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress towards developing agents that can perform parts of the research…

Computation and Language · Computer Science 2026-04-23 Nicholas Edwards , Yukyung Lee , Yujun Audrey Mao , Yulu Qin , Sebastian Schuster , Najoung Kim

Language agents, built on top of language models (LMs), are systems that can interact with complex environments, such as the open web. In this work, we examine whether such agents can perform realistic and time-consuming tasks on the web,…

Computation and Language · Computer Science 2024-10-22 Ori Yoran , Samuel Joseph Amouyal , Chaitanya Malaviya , Ben Bogin , Ofir Press , Jonathan Berant

Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings…

With the integration of large language models (LLMs), embodied agents have strong capabilities to understand and plan complicated natural language instructions. However, a foreseeable issue is that those embodied agents can also flawlessly…

Cryptography and Security · Computer Science 2025-11-03 Sheng Yin , Xianghe Pang , Yuanzhuo Ding , Menglan Chen , Yutong Bi , Yichen Xiong , Wenhao Huang , Zhen Xiang , Jing Shao , Siheng Chen

Evaluating Large Language Models (LLMs) on repository-level feature implementation is a critical frontier in software engineering. However, establishing a benchmark that faithfully mirrors realistic development scenarios remains a…

Computation and Language · Computer Science 2026-02-19 Haorui Chen , Chengze Li , Jia Li

The development of LLM-based autonomous agents for end-to-end software development represents a significant paradigm shift in software engineering. However, the scientific evaluation of these systems is hampered by significant challenges,…

Software Engineering · Computer Science 2025-11-07 Zhengran Zeng , Yixin Li , Rui Xie , Wei Ye , Shikun Zhang

As LLM-based agents are increasingly deployed in real-life scenarios, existing benchmarks fail to capture their inherent complexity of handling extensive information, leveraging diverse resources, and managing dynamic user interactions. To…

Computation and Language · Computer Science 2025-10-20 Wei He , Yueqing Sun , Hongyan Hao , Xueyuan Hao , Zhikang Xia , Qi Gu , Chengcheng Han , Dengchang Zhao , Hui Su , Kefeng Zhang , Man Gao , Xi Su , Xiaodong Cai , Xunliang Cai , Yu Yang , Yunke Zhao

LLM-based agent systems are emerging as a new software paradigm and have been widely adopted across diverse domains such as medicine, robotics, and programming. However, maintaining these systems requires substantial effort, as they are…

Artificial Intelligence · Computer Science 2025-10-27 Alfin Wijaya Rahardja , Junwei Liu , Weitong Chen , Zhenpeng Chen , Yiling Lou

Recent advances in language model (LM) agents and function calling have enabled autonomous, feedback-driven systems to solve problems across various digital domains. To better understand the unique limitations of LM agents, we introduce…

Artificial Intelligence · Computer Science 2025-03-12 Dhruv Gautam , Spandan Garg , Jinu Jang , Neel Sundaresan , Roshanak Zilouchian Moghaddam

Large Language Model (LLM) agents have shown great potential for solving real-world problems and promise to be a solution for tasks automation in industry. However, more benchmarks are needed to systematically evaluate automation agents…

Artificial Intelligence · Computer Science 2025-07-16 Yinsheng Li , Zhen Dong , Yi Shao

The application of large language models (LLMs) in the field of coding is evolving rapidly: from code assistants, to autonomous coding agents, and then to generating complete projects through natural language. Early LLM code benchmarks…

Artificial Intelligence · Computer Science 2025-05-13 Kai Xu , YiWei Mao , XinYi Guan , ZiLong Feng

The emergence of Large Language Models (LLMs) has catalyzed a paradigm shift in programming, giving rise to "vibe coding", where users can build complete projects and even control computers using natural language instructions. This paradigm…

Software Engineering · Computer Science 2026-03-27 Fanheng Kong , Jingyuan Zhang , Yang Yue , Chenxi Sun , Yang Tian , Shi Feng , Xiaocui Yang , Daling Wang , Yu Tian , Jun Du , Wenchong Zeng , Han Li , Kun Gai

While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce…

Artificial Intelligence · Computer Science 2026-03-17 Shengda Fan , Xuyan Ye , Yupeng Huo , Zhi-Yuan Chen , Yiju Guo , Shenzhi Yang , Wenkai Yang , Shuqi Ye , Jingwen Chen , Haotian Chen , Xin Cong , Yankai Lin

Large Language Models (LLMs) are becoming increasingly powerful and capable of handling complex tasks, e.g., building single agents and multi-agent systems. Compared to single agents, multi-agent systems have higher requirements for the…

Computation and Language · Computer Science 2024-08-29 Wei Wang , Dan Zhang , Tao Feng , Boyan Wang , Jie Tang

Rigorous security-focused evaluation of large language model (LLM) agents is imperative for establishing trust in their safe deployment throughout the software development lifecycle. However, existing benchmarks largely rely on synthetic…

Machine Learning · Computer Science 2025-10-23 Hwiwon Lee , Ziqi Zhang , Hanxiao Lu , Lingming Zhang

System models, a critical artifact in software development, provide a formal abstraction of both the structural and behavioral aspects of software systems, which can facilitate the early requirements analysis and architecture design.…

Software Engineering · Computer Science 2025-08-06 Dongming Jin , Zhi Jin , Linyu Li , Zheng Fang , Jia Li , Xiaohong Chen

Automatically compiling open-source software (OSS) projects is a vital, labor-intensive, and complex task, which makes it a good challenge for LLM Agents. Existing methods rely on manually curated rules and workflows, which cannot adapt to…

‹ Prev 1 2 3 10 Next ›