English
Related papers

Related papers: Agent-Diff: Benchmarking LLM Agents on Enterprise …

200 papers

The rise of large language models (LLMs) has sparked a surge of interest in agents, leading to the rapid growth of agent frameworks. Agent frameworks are software toolkits and libraries that provide standardized components, abstractions,…

Software Engineering · Computer Science 2025-12-02 Yanlin Wang , Xinyi Xu , Jiachi Chen , Tingting Bi , Wenchao Gu , Zibin Zheng

The integration of Large Language Models (LLMs) into software engineering has driven a transition from traditional rule-based systems to autonomous agentic systems capable of solving complex problems. However, systematic progress is…

Software Engineering · Computer Science 2025-10-24 Jiale Guo , Suizhi Huang , Mei Li , Dong Huang , Xingsheng Chen , Regina Zhang , Zhijiang Guo , Han Yu , Siu-Ming Yiu , Pietro Lio , Kwok-Yan Lam

The development of LLM-based autonomous agents for end-to-end software development represents a significant paradigm shift in software engineering. However, the scientific evaluation of these systems is hampered by significant challenges,…

Software Engineering · Computer Science 2025-11-07 Zhengran Zeng , Yixin Li , Rui Xie , Wei Ye , Shikun Zhang

Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications. Growing research efforts aim to develop LLM-based agents to address practical demands, introducing a new challenge: agentic scenarios…

Artificial Intelligence · Computer Science 2025-05-23 Yunjia Qi , Hao Peng , Xiaozhi Wang , Amy Xin , Youfeng Liu , Bin Xu , Lei Hou , Juanzi Li

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture…

Artificial Intelligence · Computer Science 2026-04-24 Keyu Li , Junhao Shi , Yang Xiao , Mohan Jiang , Jie Sun , Yunze Wu , Dayuan Fu , Shijie Xia , Xiaojie Cai , Tianze Xu , Weiye Si , Wenjie Li , Dequan Wang , Pengfei Liu

Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current…

Software Engineering · Computer Science 2026-02-12 Qixing Zhou , Jiacheng Zhang , Haiyang Wang , Rui Hao , Jiahe Wang , Minghao Han , Yuxue Yang , Shuzhe Wu , Feiyang Pan , Lue Fan , Dandan Tu , Zhaoxiang Zhang

Large language model (LLM) agents are increasingly expected to operate in enterprise environments, where work is distributed across specialized roles, permission-controlled systems, and cross-departmental procedures. However, existing…

The advances made by Large Language Models (LLMs) have led to the pursuit of LLM agents that can solve intricate, multi-step reasoning tasks. As with any research pursuit, benchmarking and evaluation are key corner stones to efficient and…

Artificial Intelligence · Computer Science 2024-04-10 Luca Gioacchini , Giuseppe Siracusano , Davide Sanvito , Kiril Gashteovski , David Friede , Roberto Bifulco , Carolin Lawrence

The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present…

As LLMs are increasingly deployed as agents, reliable assessment of their agentic capabilities has become essential. However, reported benchmark scores often jointly reflect model capability and the implementation choices each benchmark is…

Artificial Intelligence · Computer Science 2026-05-28 Pengyu Zhu , Lijun Li , Yaxing Lyu , Qianxin Luo , Jingyi Yang , Yi Liu , Tingfeng Hui , Xinyu Yuan , Li Sun , Sen Su , Jing Shao

As the focus in LLM-based coding shifts from static single-step code generation to multi-step agentic interaction with tools and environments, understanding which tasks will challenge agents and why becomes increasingly difficult. This is…

Artificial Intelligence · Computer Science 2026-04-02 Chris Ge , Daria Kryvosheieva , Daniel Fried , Uzay Girit , Kaivalya Hariharan

Large language models generate plausible code but cannot verify correctness. Existing multi-agent systems simulate execution or leave verification optional. We introduce execution-grounded verification as a first-class principle: every code…

Software Engineering · Computer Science 2026-04-16 Rajesh Kumar , Waqar Ali , Junaid Ahmed , Najma Imtiaz Ali , Shaban Usman

The remarkable capabilities of Large Language Model (LLM)-driven agents have enabled sophisticated systems to tackle complex, multi-step tasks, but their escalating costs threaten scalability and accessibility. This work presents the first…

The emergence of agentic recommender systems powered by Large Language Models (LLMs) represents a paradigm shift in personalized recommendations, leveraging LLMs' advanced reasoning and role-playing capabilities to enable autonomous,…

Information Retrieval · Computer Science 2025-05-29 Yu Shang , Peijie Liu , Yuwei Yan , Zijing Wu , Leheng Sheng , Yuanqing Yu , Chumeng Jiang , An Zhang , Fengli Xu , Yu Wang , Min Zhang , Yong Li

Large Language Models (LLMs) are becoming increasingly powerful and capable of handling complex tasks, e.g., building single agents and multi-agent systems. Compared to single agents, multi-agent systems have higher requirements for the…

Computation and Language · Computer Science 2024-08-29 Wei Wang , Dan Zhang , Tao Feng , Boyan Wang , Jie Tang

This paper presents a benchmark self-evolving framework to dynamically evaluate rapidly advancing Large Language Models (LLMs), aiming for a more accurate assessment of their capabilities and limitations. We utilize a multi-agent system to…

Computation and Language · Computer Science 2024-02-20 Siyuan Wang , Zhuohan Long , Zhihao Fan , Zhongyu Wei , Xuanjing Huang

The rise of agentic AI systems, where agents collaborate to perform diverse tasks, poses new challenges with observing, analyzing and optimizing their behavior. Traditional evaluation and benchmarking approaches struggle to handle the…

Artificial Intelligence · Computer Science 2025-03-11 Dany Moshkovich , Hadar Mulian , Sergey Zeltyn , Natti Eder , Inna Skarbovsky , Roy Abitbol

Evaluating Large Language Models (LLMs) as general-purpose agents is essential for understanding their capabilities and facilitating their integration into practical applications. However, the evaluation process presents substantial…

Computation and Language · Computer Science 2024-12-25 Chang Ma , Junlei Zhang , Zhihao Zhu , Cheng Yang , Yujiu Yang , Yaohui Jin , Zhenzhong Lan , Lingpeng Kong , Junxian He

Advancements in Large Language Models (LLMs) are revolutionizing the development of autonomous agentic systems by enabling dynamic, context-aware task decomposition and automated tool selection. These sophisticated systems possess…

Artificial Intelligence · Computer Science 2024-10-31 Adrian Garret Gabriel , Alaa Alameer Ahmad , Shankar Kumar Jeyakumar

With the advent of Large Language Models (LLMs), general-purpose agents have seen fundamental advancements. However, evaluating these agents presents unique challenges that distinguish them from static QA benchmarks. We observe that current…

Artificial Intelligence · Computer Science 2026-05-27 Pengyu Zhu , Li Sun , Philip S. Yu , Sen Su
‹ Prev 1 2 3 10 Next ›