Related papers: Benchmarking Data Science Agents

DSBench: How Far Are Data Science Agents from Becoming Data Science Experts?

Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have demonstrated impressive language/vision reasoning abilities, igniting the recent trend of building agents for targeted applications such as shopping assistants or AI…

Artificial Intelligence · Computer Science 2025-04-14 Liqiang Jing , Zhehui Huang , Xiaoyang Wang , Wenlin Yao , Wenhao Yu , Kaixin Ma , Hongming Zhang , Xinya Du , Dong Yu

A Survey on Large Language Model-based Agents for Statistics and Data Science

In recent years, data science agents powered by Large Language Models (LLMs), known as "data agents," have shown significant potential to transform the traditional data analysis paradigm. This survey provides an overview of the evolution,…

Artificial Intelligence · Computer Science 2025-12-01 Maojun Sun , Ruijian Han , Binyan Jiang , Houduo Qi , Defeng Sun , Yancheng Yuan , Jian Huang

LLM-Based Data Science Agents: A Survey of Capabilities, Challenges, and Future Directions

Recent advances in large language models (LLMs) have enabled a new class of AI agents that automate multiple stages of the data science workflow by integrating planning, tool use, and multimodal reasoning across text, code, tables, and…

Artificial Intelligence · Computer Science 2025-10-07 Mizanur Rahman , Amran Bhuiyan , Mohammed Saidul Islam , Md Tahmid Rahman Laskar , Ridwan Mahbub , Ahmed Masry , Shafiq Joty , Enamul Hoque

DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers,…

Artificial Intelligence · Computer Science 2026-01-21 Maojun Sun , Yifei Xie , Yue Wu , Ruijian Han , Binyan Jiang , Defeng Sun , Yancheng Yuan , Jian Huang

Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents

Data science aims to extract insights from data to support decision-making processes. Recently, Large Language Models (LLMs) have been increasingly used as assistants for data science, by suggesting ideas, techniques and small code…

Artificial Intelligence · Computer Science 2025-10-23 Irene Testini , José Hernández-Orallo , Lorenzo Pacchiardi

InsightEval: An Expert-Curated Benchmark for Assessing Insight Discovery in LLM-Driven Data Agents

Data analysis has become an indispensable part of scientific research. To discover the latent knowledge and insights hidden within massive datasets, we need to perform deep exploratory analysis to realize their full value. With the advent…

Artificial Intelligence · Computer Science 2026-05-29 Zhenghao Zhu , Yuanfeng Song , Xin Chen , Chengzhong Liu , Yakun Cui , Caleb Chen Cao , Sirui Han , Yike Guo

Large Language Model-based Data Science Agent: A Survey

The rapid advancement of Large Language Models (LLMs) has driven novel applications across diverse domains, with LLM-based agents emerging as a crucial area of exploration. This survey presents a comprehensive analysis of LLM-based agents…

Artificial Intelligence · Computer Science 2025-11-25 Ke Chen , Peiran Wang , Yaoning Yu , Xianyang Zhan , Haohan Wang

DSBC : Data Science task Benchmarking with Context engineering

Recent advances in large language models (LLMs) have significantly impacted data science workflows, giving rise to specialized data science agents designed to automate analytical tasks. Despite rapid adoption, systematic benchmarks…

Artificial Intelligence · Computer Science 2025-08-08 Ram Mohan Rao Kadiyala , Siddhant Gupta , Jebish Purbey , Giulio Martini , Ali Shafique , Suman Debnath , Hamza Farooq

IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis

Large Language Models (LLMs) show promise as data analysis agents, but existing benchmarks overlook the iterative nature of the field, where experts' decisions evolve with deeper insights of the dataset. To address this, we introduce…

Computation and Language · Computer Science 2025-06-09 Hanyu Li , Haoyu Liu , Tingyu Zhu , Tianyu Guo , Zeyu Zheng , Xiaotie Deng , Michael I. Jordan

DataGovBench: Benchmarking LLM Agents for Real-World Data Governance Workflows

Data governance ensures data quality, security, and compliance through policies and standards, a critical foundation for scaling modern AI development. Recently, large language models (LLMs) have emerged as a promising solution for…

Artificial Intelligence · Computer Science 2025-12-09 Zhou Liu , Zhaoyang Han , Guochen Yan , Hao Liang , Bohan Zeng , Xing Chen , Yuanfeng Song , Wentao Zhang

Large Language Models in the Data Science Lifecycle: A Systematic Mapping Study

In recent years, Large Language Models (LLMs) have emerged as transformative tools across numerous domains, impacting how professionals approach complex analytical tasks. This systematic mapping study comprehensively examines the…

Computers and Society · Computer Science 2025-08-19 Sai Sanjna Chintakunta , Nathalia Nascimento , Everton Guimaraes

DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

Autonomous data science, from raw data sources to analyst-grade deep research reports, has been a long-standing challenge, and is now becoming feasible with the emergence of powerful large language models (LLMs). Recent workflow-based data…

Artificial Intelligence · Computer Science 2025-10-21 Shaolei Zhang , Ju Fan , Meihao Fan , Guoliang Li , Xiaoyong Du

DABstep: Data Agent Benchmark for Multi-step Reasoning

We introduce DABstep, a novel benchmark for evaluating AI agents on realistic multi-step data analysis tasks. DABstep comprises over 450 real-world challenges derived from a financial analytics platform, requiring models to combine…

Machine Learning · Computer Science 2025-07-01 Alex Egg , Martin Iglesias Goyanes , Friso Kingma , Andreu Mora , Leandro von Werra , Thomas Wolf

DS-STAR: Data Science Agent for Solving Diverse Tasks across Heterogeneous Formats and Open-Ended Queries

While large language models (LLMs) have shown promise in automating data science, existing agents often struggle with the complexity of real-world workflows that require exploring multiple sources and synthesizing open-ended insights. In…

Artificial Intelligence · Computer Science 2026-02-25 Jaehyun Nam , Jinsung Yoon , Jiefeng Chen , Raj Sinha , Jinwoo Shin , Tomas Pfister

DCA-Bench: A Benchmark for Dataset Curation Agents

The quality of datasets plays an increasingly crucial role in the research and development of modern artificial intelligence (AI). Despite the proliferation of open dataset platforms nowadays, data quality issues, such as incomplete…

Artificial Intelligence · Computer Science 2025-05-28 Benhao Huang , Yingzhuo Yu , Jin Huang , Xingjian Zhang , Jiaqi Ma

Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate

Despite the utility of Large Language Models (LLMs) across a wide range of tasks and scenarios, developing a method for reliably evaluating LLMs across varied contexts continues to be challenging. Modern evaluation approaches often use LLMs…

Computation and Language · Computer Science 2024-01-31 Steffi Chern , Ethan Chern , Graham Neubig , Pengfei Liu

DatawiseAgent: A Notebook-Centric LLM Agent Framework for Adaptive and Robust Data Science Automation

Existing large language model (LLM) agents for automating data science show promise, but they remain constrained by narrow task scopes, limited generalization across tasks and models, and over-reliance on state-of-the-art (SOTA) LLMs. We…

Computation and Language · Computer Science 2025-10-06 Ziming You , Yumiao Zhang , Dexuan Xu , Yiwei Lou , Yandong Yan , Wei Wang , Huaming Zhang , Yu Huang

InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks

In this paper, we introduce InfiAgent-DABench, the first benchmark specifically designed to evaluate LLM-based agents on data analysis tasks. These tasks require agents to end-to-end solving complex tasks by interacting with an execution…

Computation and Language · Computer Science 2024-03-12 Xueyu Hu , Ziyu Zhao , Shuang Wei , Ziwei Chai , Qianli Ma , Guoyin Wang , Xuwu Wang , Jing Su , Jingjing Xu , Ming Zhu , Yao Cheng , Jianbo Yuan , Jiwei Li , Kun Kuang , Yang Yang , Hongxia Yang , Fei Wu

DSMentor: Enhancing Data Science Agents with Curriculum Learning and Online Knowledge Accumulation

Large language model (LLM) agents have shown promising performance in generating code for solving complex data science problems. Recent studies primarily focus on enhancing in-context learning through improved search, sampling, and planning…

Artificial Intelligence · Computer Science 2025-05-21 He Wang , Alexander Hanbo Li , Yiqun Hu , Sheng Zhang , Hideo Kobayashi , Jiani Zhang , Henry Zhu , Chung-Wei Hang , Patrick Ng

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

The advancements of large language models (LLMs) have piqued growing interest in developing LLM-based language agents to automate scientific discovery end-to-end, which has sparked both excitement and skepticism about their true…

Computation and Language · Computer Science 2025-04-01 Ziru Chen , Shijie Chen , Yuting Ning , Qianheng Zhang , Boshi Wang , Botao Yu , Yifei Li , Zeyi Liao , Chen Wei , Zitong Lu , Vishal Dey , Mingyi Xue , Frazier N. Baker , Benjamin Burns , Daniel Adu-Ampratwum , Xuhui Huang , Xia Ning , Song Gao , Yu Su , Huan Sun