Related papers: Benchmarking Agentic Workflow Generation

GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving

Large Language Model (LLM)-based agents demonstrate strong reasoning and execution capabilities on complex tasks when guided by structured instructions, commonly referred to as workflows. However, existing workflow-assisted agent serving…

Machine Learning · Computer Science 2026-05-22 Ao Li , Shangpeng Yang , Fahao Chen , Tianheng Xu , Peng Li , Zhou Su

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents, yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination…

Multiagent Systems · Computer Science 2025-03-05 Kunlun Zhu , Hongyi Du , Zhaochen Hong , Xiaocheng Yang , Shuyi Guo , Zhe Wang , Zhenhailong Wang , Cheng Qian , Xiangru Tang , Heng Ji , Jiaxuan You

WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models

Recent advancements in large language models (LLMs) have driven a revolutionary paradigm shift in process automation from Robotic Process Automation to Agentic Process Automation by automating the workflow orchestration procedure based on…

Software Engineering · Computer Science 2024-11-11 Shengda Fan , Xin Cong , Yuepeng Fu , Zhong Zhang , Shuyan Zhang , Yuanwei Liu , Yesai Wu , Yankai Lin , Zhiyuan Liu , Maosong Sun

SEW: Self-Evolving Agentic Workflows for Automated Code Generation

Large Language Models (LLMs) have demonstrated effectiveness in code generation tasks. To enable LLMs to address more complex coding challenges, existing research has focused on crafting multi-agent systems with agentic workflows, where…

Software Engineering · Computer Science 2026-04-15 Siwei Liu , Jinyuan Fang , Han Zhou , Yingxu Wang , Zaiqiao Meng

Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows

Large language model (LLM) agents are increasingly expected to operate in enterprise environments, where work is distributed across specialized roles, permission-controlled systems, and cross-departmental procedures. However, existing…

Multiagent Systems · Computer Science 2026-05-12 Tao Yu , Hao Wang , Changyu Li , Shenghua Chai , Minghui Zhang , Zhongtian Luo , Yuxuan Zhou , Haopeng Jin , Zhaolu Kang , Jiabing Yang , YiFan Zhang , Xinming Wang , Hongzhu Yi , Zheqi He , Jing-Shu Zheng , Xi Yang , Yan Huang , Liang Wang

AFlow: Automating Agentic Workflow Generation

Large language models (LLMs) have demonstrated remarkable potential in solving complex tasks across diverse domains, typically by employing agentic workflows that follow detailed instructions and operational sequences. However, constructing…

Artificial Intelligence · Computer Science 2025-04-16 Jiayi Zhang , Jinyu Xiang , Zhaoyang Yu , Fengwei Teng , Xionghui Chen , Jiaqi Chen , Mingchen Zhuge , Xin Cheng , Sirui Hong , Jinlin Wang , Bingnan Zheng , Bang Liu , Yuyu Luo , Chenglin Wu

GraphicBench: A Planning Benchmark for Graphic Design with Language Agents

Large Language Model (LLM)-powered agents have unlocked new possibilities for automating human tasks. While prior work has focused on well-defined tasks with specified goals, the capabilities of agents in creative design tasks with…

Artificial Intelligence · Computer Science 2025-04-17 Dayeon Ki , Tianyi Zhou , Marine Carpuat , Gang Wu , Puneet Mathur , Viswanathan Swaminathan

TaskBench: Benchmarking Large Language Models for Task Automation

In recent years, the remarkable progress of large language models (LLMs) has sparked interest in task automation, which involves decomposing complex tasks described by user instructions into sub-tasks and invoking external tools to execute…

Computation and Language · Computer Science 2024-11-04 Yongliang Shen , Kaitao Song , Xu Tan , Wenqi Zhang , Kan Ren , Siyu Yuan , Weiming Lu , Dongsheng Li , Yueting Zhuang

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture…

Artificial Intelligence · Computer Science 2026-04-24 Keyu Li , Junhao Shi , Yang Xiao , Mohan Jiang , Jie Sun , Yunze Wu , Dayuan Fu , Shijie Xia , Xiaojie Cai , Tianze Xu , Weiye Si , Wenjie Li , Dequan Wang , Pengfei Liu

WritingBench: A Comprehensive Benchmark for Generative Writing

Recent advancements in large language models (LLMs) have significantly enhanced text generation capabilities, yet evaluating their performance in generative writing remains a challenge. Existing benchmarks primarily focus on generic text…

Artificial Intelligence · Computer Science 2025-12-01 Yuning Wu , Jiahao Mei , Ming Yan , Chenliang Li , Shaopeng Lai , Yuran Ren , Zijia Wang , Ji Zhang , Mengyue Wu , Qin Jin , Fei Huang

LongEval: A Comprehensive Analysis of Long-Text Generation Through a Plan-based Paradigm

Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks, yet their ability to generate long-form content remains poorly understood and evaluated. Our analysis reveals that current LLMs…

Computation and Language · Computer Science 2025-03-10 Siwei Wu , Yizhi Li , Xingwei Qu , Rishi Ravikumar , Yucheng Li , Tyler Loakman , Shanghaoran Quan , Xiaoyong Wei , Riza Batista-Navarro , Chenghua Lin

AutoFlow: Automated Workflow Generation for Large Language Model Agents

Recent advancements in Large Language Models (LLMs) have shown significant progress in understanding complex natural language. One important application of LLM is LLM-based AI Agent, which leverages the ability of LLM as well as external…

Computation and Language · Computer Science 2024-07-19 Zelong Li , Shuyuan Xu , Kai Mei , Wenyue Hua , Balaji Rama , Om Raheja , Hao Wang , He Zhu , Yongfeng Zhang

WorkflowGen:an adaptive workflow generation mechanism driven by trajectory experience

Large language model (LLM) agents often suffer from high reasoning overhead, excessive token consumption, unstable execution, and inability to reuse past experiences in complex tasks like business queries, tool use, and workflow…

Machine Learning · Computer Science 2026-04-23 Ruocan Wei , Shufeng Wang , Ziwei Shi

FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents

LLM-based agents have emerged as promising tools, which are crafted to fulfill complex tasks by iterative planning and action. However, these agents are susceptible to undesired planning hallucinations when lacking specific knowledge for…

Computation and Language · Computer Science 2024-06-24 Ruixuan Xiao , Wentao Ma , Ke Wang , Yuchuan Wu , Junbo Zhao , Haobo Wang , Fei Huang , Yongbin Li

Multi-LLM Orchestration for High-Quality Code Generation: Exploiting Complementary Model Strengths

Large Language Models (LLMs) have become central to automated code generation, yet existing approaches operate within a single-LLM paradigm: one model is selected and applied throughout the entire generation process. We observe that…

Software Engineering · Computer Science 2026-04-21 Huashan Chen , Zhenyu Qi , Haotang Li , Hong Chen , Jinfu Chen , Kebin Peng , In Kee Kim , Kyu Hyung Lee , Sen He , Weiyi Shang

ArchBench: Benchmarking Generative-AI for Software Architecture Tasks

Benchmarks for large language models (LLMs) have progressed from snippet-level function generation to repository-level issue resolution, yet they overwhelmingly target implementation correctness. Software architecture tasks remain…

Software Engineering · Computer Science 2026-03-19 Bassam Adnan , Aviral Gupta , Sreemaee Akshathala , Karthik Vaidhyanathan

PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

Planning is a fundamental capability for large language models (LLMs) because such complex tasks require models to coordinate goals, constraints, resources, and long-term consequences into executable and verifiable solutions. Existing…

Artificial Intelligence · Computer Science 2026-05-21 Ziliang Zhao , Zenan Xu , Shuting Wang , Hongjin Qian , Yan Lei , Minda Hu , Zhao Wang , Shihan Dou , Zhicheng Dou , Pluto Zhou

A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System

The integration of Large Language Models (LLMs) into software engineering has driven a transition from traditional rule-based systems to autonomous agentic systems capable of solving complex problems. However, systematic progress is…

Software Engineering · Computer Science 2025-10-24 Jiale Guo , Suizhi Huang , Mei Li , Dong Huang , Xingsheng Chen , Regina Zhang , Zhijiang Guo , Han Yu , Siu-Ming Yiu , Pietro Lio , Kwok-Yan Lam

AgentBench: Evaluating LLMs as Agents

The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present…

Artificial Intelligence · Computer Science 2025-10-07 Xiao Liu , Hao Yu , Hanchen Zhang , Yifan Xu , Xuanyu Lei , Hanyu Lai , Yu Gu , Hangliang Ding , Kaiwen Men , Kejuan Yang , Shudan Zhang , Xiang Deng , Aohan Zeng , Zhengxiao Du , Chenhui Zhang , Sheng Shen , Tianjun Zhang , Yu Su , Huan Sun , Minlie Huang , Yuxiao Dong , Jie Tang

BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

Large Language Models (LLMs) are becoming increasingly powerful and capable of handling complex tasks, e.g., building single agents and multi-agent systems. Compared to single agents, multi-agent systems have higher requirements for the…

Computation and Language · Computer Science 2024-08-29 Wei Wang , Dan Zhang , Tao Feng , Boyan Wang , Jie Tang