Related papers: Chat2Workflow: A Benchmark for Generating Executab…

WfBench: Automated Generation of Scientific Workflow Benchmarks

The prevalence of scientific workflows with high computational demands calls for their execution on various distributed computing platforms, including large-scale leadership-class high-performance computing (HPC) clusters. To handle the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-02-10 Tainã Coleman , Henri Casanova , Ketan Maheshwari , Loïc Pottier , Sean R. Wilkinson , Justin Wozniak , Frédéric Suter , Mallikarjun Shankar , Rafael Ferreira da Silva

EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation

Large language models are increasingly applied to various development scenarios. However, in on-chain transaction scenarios, even a minor error can cause irreversible loss for users. Existing evaluations often overlook execution accuracy…

Computation and Language · Computer Science 2026-04-08 Pei Yang , Wanyi Chen , Ke Wang , Lynn Ai , Eric Yang , Tianyu Shi

Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text

Automated data visualization plays a crucial role in simplifying data interpretation, enhancing decision-making, and improving efficiency. While large language models (LLMs) have shown promise in generating visualizations from natural…

Computation and Language · Computer Science 2025-07-29 Mizanur Rahman , Md Tahmid Rahman Laskar , Shafiq Joty , Enamul Hoque

Dialog-based Automation of Decision Making in Processes

The use of chatbots has spread, generating great interest in the industry for the possibility of automating tasks within the execution of their processes. The implementation of chatbots, however simple, is a complex endeavor that involves…

Software Engineering · Computer Science 2021-09-03 Bedilia Estrada-Torres , Adela del-Río-Ortega , Manuel Resinas

WorkTeam: Constructing Workflows from Natural Language with Multi-Agents

Workflows play a crucial role in enhancing enterprise efficiency by orchestrating complex processes with multiple tools or components. However, hand-crafted workflow construction requires expert knowledge, presenting significant technical…

Computation and Language · Computer Science 2025-03-31 Hanchao Liu , Rongjun Li , Weimin Xiong , Ziyu Zhou , Wei Peng

GraphFlow: An Architecture for Formally Verifiable Visual Workflows Enabling Reliable Agentic AI Automation

GraphFlow is a visual workflow system designed to improve the reliability of agentic AI automation in multi-step, mission-critical processes. In these workflows, small errors compound rapidly: under an idealized model of independent steps,…

Artificial Intelligence · Computer Science 2026-05-15 Drewry H. Morris , Luis Valles , Reza Hosseini Ghomi

DyFlow: Dynamic Workflow Framework for Agentic Reasoning

Agent systems based on large language models (LLMs) have shown great potential in complex reasoning tasks, but building efficient and generalizable workflows remains a major challenge. Most existing approaches rely on manually designed…

Computation and Language · Computer Science 2025-10-01 Yanbo Wang , Zixiang Xu , Yue Huang , Xiangqi Wang , Zirui Song , Lang Gao , Chenxi Wang , Xiangru Tang , Yue Zhao , Arman Cohan , Xiangliang Zhang , Xiuying Chen

Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical…

Software Engineering · Computer Science 2026-04-02 Zehai He , Wenyi Hong , Zhen Yang , Ziyang Pan , Mingdao Liu , Xiaotao Gu , Jie Tang

TopoPilot: Reliable Conversational Workflow Automation for Topological Data Analysis and Visualization

Recent agentic systems demonstrate that large language models can generate scientific visualizations from natural language. However, reliability remains a major limitation: systems may execute invalid operations, introduce subtle but…

Human-Computer Interaction · Computer Science 2026-03-27 Nathaniel Gorski , Shusen Liu , Bei Wang

AutoFlow: Automated Workflow Generation for Large Language Model Agents

Recent advancements in Large Language Models (LLMs) have shown significant progress in understanding complex natural language. One important application of LLM is LLM-based AI Agent, which leverages the ability of LLM as well as external…

Computation and Language · Computer Science 2024-07-19 Zelong Li , Shuyuan Xu , Kai Mei , Wenyue Hua , Balaji Rama , Om Raheja , Hao Wang , He Zhu , Yongfeng Zhang

ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

Interactive agent benchmarks face a tension between scalable construction and realistic workflow evaluation. Hand-authored tasks are expensive to extend and revise, while static prompt evaluation misses failures that only appear when agents…

Artificial Intelligence · Computer Science 2026-05-19 Yuxiang Lai , Peng Xia , Haonian Ji , Kaiwen Xiong , Kaide Zeng , Jiaqi Liu , Fang Wu , Jike Zhong , Zeyu Zheng , Cihang Xie , Huaxiu Yao

FeatBench: Towards More Realistic Evaluation of Feature-level Code Generation

Evaluating Large Language Models (LLMs) on repository-level feature implementation is a critical frontier in software engineering. However, establishing a benchmark that faithfully mirrors realistic development scenarios remains a…

Computation and Language · Computer Science 2026-02-19 Haorui Chen , Chengze Li , Jia Li

Benchmarking Agentic Workflow Generation

Large Language Models (LLMs), with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a…

Computation and Language · Computer Science 2025-02-25 Shuofei Qiao , Runnan Fang , Zhisong Qiu , Xiaobin Wang , Ningyu Zhang , Yong Jiang , Pengjun Xie , Fei Huang , Huajun Chen

One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation…

Computation and Language · Computer Science 2026-03-11 Chengyu Shen , Yanheng Hou , Minghui Pan , Runming He , Zhen Hao Wong , Meiyi Qiang , Zhou Liu , Hao Liang , Peichao Lai , Zeang Sheng , Wentao Zhang

Tool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent

Research on self-evolving language agents has accelerated, drawing increasing attention to their ability to create, adapt, and maintain tools from task requirements. However, existing benchmarks predominantly rely on predefined…

Software Engineering · Computer Science 2026-03-09 Bowei Xia , Mengkang Hu , Shijian Wang , Jiarui Jin , Wenxiang Jiao , Yuan Lu , Kexin Li , Ping Luo

RobustFlow: Towards Robust Agentic Workflow Generation

The automated generation of agentic workflows is a promising frontier for enabling large language models (LLMs) to solve complex tasks. However, our investigation reveals that the robustness of agentic workflow remains a critical,…

Multiagent Systems · Computer Science 2025-10-07 Shengxiang Xu , Jiayi Zhang , Shimin Di , Yuyu Luo , Liang Yao , Hanmo Liu , Jia Zhu , Fan Liu , Min-Ling Zhang

Comparing Generative Chatbots Based on Process Requirements

Business processes are commonly represented by modelling languages, such as Event-driven Process Chain (EPC), Yet Another Workflow Language (YAWL), and the most popular standard notation for modelling business processes, the Business…

Computation and Language · Computer Science 2023-12-08 Luis Fernando Lins , Nathalia Nascimento , Paulo Alencar , Toacy Oliveira , Donald Cowan

VisCoder2: Building Multi-Language Visualization Coding Agents

Large language models (LLMs) have recently enabled coding agents capable of generating, executing, and revising visualization code. However, existing models often fail in practical workflows due to limited language coverage, unreliable…

Software Engineering · Computer Science 2026-04-09 Yuansheng Ni , Songcheng Cai , Xiangchao Chen , Jiarong Liang , Zhiheng Lyu , Jiaqi Deng , Kai Zou , Ping Nie , Fei Yuan , Xiang Yue , Wenhu Chen

Enhancing Software Development with Context-Aware Conversational Agents: A User Study on Developer Interactions with Chatbots

Software development is a cognitively intensive process requiring multitasking, adherence to evolving workflows, and continuous learning. With the rise of large language model (LLM)-based tools, such as conversational agents (CAs), there is…

Software Engineering · Computer Science 2025-05-14 Glaucia Melo , Paulo Alencar , Donald Cowan

Chat2VIS: Generating Data Visualisations via Natural Language using ChatGPT, Codex and GPT-3 Large Language Models

The field of data visualisation has long aimed to devise solutions for generating visualisations directly from natural language text. Research in Natural Language Interfaces (NLIs) has contributed towards the development of such techniques.…

Human-Computer Interaction · Computer Science 2023-02-14 Paula Maddigan , Teo Susnjak